Trouble with plotly charts - pandas

I am giving myself an intro to plotting data and have come across some trouble. I am working on a line chart that I plan on making animated as soon as I figure out this problem.
I want a graph that looks like this:
However this code I have now:
`x=df_pre_2003['year']
y=df_pre_2003['nAllNeonic']
trace=go.Scatter(
x=x,
y=y
)
data=[trace]
ply.plot(data, filename='test.html')`
is giving me this:
So I added y=df_pre_2003['nAllNeonic'].sum()
but, now it says ValueError:
Invalid value of type 'builtins.float' received for the 'y' property of scatter
Received value: 1133180.4000000006
The 'y' property is an array that may be specified as a tuple,
list, numpy array, or pandas Series
Which I tried and it still did not work. The data types for year is int64 and nAllNeonic is float64.

It looks like you have to sort the values first based on the date. Now it's connecting a value in the year 1997 with a value in 1994.
df_pre_2003.sort_values(by = ['year'])

This is not to answer this question, but to share my similar case for any future research needs:
In my case the error message was coming when I tried to export the django models objects to use it in the plotly scatter chart, and the error was as follows:
The 'x' property is an array that may be specified as a tuple, list, numpy array, or pandas Series
The solution for this in my case was to export the django model info into pandas data frame then use the pandas data frame columns instead of the model fields name.

Related

Pandas dataframe rendered with bokeh shows no marks

I am attempting to create a simple hbar() chart on two columns [project, bug_count]. Sample dataframe follows:
df = pd.DataFrame({'project': ['project1', 'project2', 'project3', 'project4'],
'bug_count': [43683, 31647, 27494, 24845]})
When attempting to render any chart: scatter, circle, vbar etc... I get a blank chart.
This very simple code snippet shows an empty viz. This example shows a f.circle() just for demonstration, I'm actually trying to implement a f.hbar().
from bokeh.io import show, output_notebook
from bokeh.plotting import figure
f = figure()
f.circle(df['project'], df['bug_count'],size = 10)
show(f)
The values of df['project'] are strings, i.e. categorical values, not numbers. Categorical ranges must be explicitly provided, since you are the only person who possess the knowledge of what order the arbitrary factors should appear in on the axis. Something like
p = figure(x_range=sorted(set(df['project'])))
There is an entire chapter in the User's Guide devoted to Handling Categorical Data, with many complete examples (including many bar charts) that you can refer to.

Holoviz panel will not print pandas dataframe row in Jupyter notebook

I'm trying to recreate the first panel.interact example in the Holoviz tutorial using a Pandas dataframe instead of a Dask dataframe. I get the slider, but the pandas dataframe row does not show.
See the original example at: http://holoviz.org/tutorial/Building_Panels.html
I've tried using Dask as in the Holoviz example. Dask rows print out just fine, but it demonstrates that panel seem to treat Dask dataframe rows differently for printing than Pandas dataframe rows. Here's my minimal code:
import pandas as pd
import panel
l1 = ['a','b','c','d','a','b']
l2 = [1,2,3,4,5,6]
df = pd.DataFrame({'cat':l1,'val':l2})
def select_row(rowno=0):
row = df.loc[rowno]
return row
panel.extension()
panel.extension('katex')
panel.interact(select_row, rowno=(0, 5))
I've included a line with the katex extension, because without it, I get a warning that it is needed. Without it, I don't even get the slider.
I can call the select_row(rowno=0) function separately in a Jupyter cell and get a nice printout of the row, so it appears the function is working as it should.
Any help in getting this to work would be most appreciated. Thanks.
Got a solution. With Pandas, loc[rowno:rowno] returns a pandas.core.frame.DataFrame object of length 1 which works fine with panel while loc[rowno] returns a pandas.core.series.Series object which does not work so well. Thus modifying the select_row() function like this makes it all work:
def select_row(rowno=0):
row = df.loc[rowno:rowno]
return row
Still not sure, however, why panel will print out the Dataframe object and not the Series object.
Note: if you use iloc, then you use add +1, i.e., df.iloc[rowno:rowno+1].

Pandas- Groupby Plot is not working for object

I am new to Pandas and doing some analysis csv file. I have successfully read csv and shown all details. I have got two column as an object type which I need to plot. I have done groupy for those two columns and getting first and all data, However I am not sure, how to do plotting for these object types in Pandas. Below is my sample of Groupby and smaple for event_type and event_description for which I need to do plotting. If I can plot for Application and Network for event_type that will be great help
import pandas as pd
data = pd.read_csv('/Users/temp/Downloads/sample.csv’)
data.head()
grouped_df = data.groupby([ "event_type", "event_description"])
grouped_df.first()
As commented - need more info, but IIUC, try:
df['event_type'].value_counts(sort=True).plot(kind='barh')

How to Box Plot Panda Timestamp series ? (Errors with Timestamp type)

I'm using:
Pandas version 0.23.0
Python version 3.6.5
Seaborn version 0.81.1
I'd like a Box Plot of a column of Timestamp data. My dataframe is not a time series, the index is just an integer but I have created a column of Timestamp data using:
# create a new column of time stamps corresponding to EVENT_DTM
data['EVENT_DTM_TS'] =pd.to_datetime(data.EVENT_DTM, errors='coerce')
I filter out all NaT values resulting from coerce.
dt_filtered_time = data[~data.EVENT_DTM_TS.isnull()]
At this point my data looks good and I can confirm the type of the EVENT_DM_TS column is Timestamp with no invalid values.
Finally to generate the single variable box plot I invoke:
ax = sns.boxplot(x=dt_filtered_time.EVENT_DTM_TS)
and get the error:
TypeError: ufunc add cannot use operands with types dtype('M8[ns]') and dtype( 'M8[ns]')
I've Googled and found:
https://github.com/pandas-dev/pandas/issues/13844
https://github.com/matplotlib/matplotlib/issues/9610
which seemingly indicate issues with data type representations.
I've also seen references to issues with pandas version 0.21.0.
Anyone have an easy fix suggestion or do I need to use a different data type to plot the box plot. I'd like to get the single picture of the distribution of the timestamp data.
This is the code I ended up with:
import time
#plt.FuncFormatter
def convert_to_date_string(x,pos):
return time.strftime('%Y-%m',time.localtime(x))
plt.figure(figsize=(15,4))
sns.set(style='whitegrid')
temp = dt_filtered_time.EVENT_DTM_TS.astype(np.int64)/1E9
ax = sns.boxplot(x=temp)
ax.xaxis.set_major_formatter(convert_to_date_string)
Here is the result:
Credit goes to ImportanceOfBeingErnest whose comment pointed me towards this solution.

Why does DataFrameGroupBy.boxplot method throw error when given argument "subplots=True/False"?

I can use DataFrameGroupBy.boxplot(...) to create a boxplot in the following way:
In [15]: df = pd.DataFrame({"gene_length":[100,100,100,200,200,200,300,300,300],
...: "gene_id":[1,1,1,2,2,2,3,3,3],
...: "density":[0.4,1.1,1.2,1.9,2.0,2.5,2.2,3.0,3.3],
...: "cohort":["USA","EUR","FIJ","USA","EUR","FIJ","USA","EUR","FIJ"]})
In [17]: df.groupby("cohort").boxplot(column="density",by="gene_id")
In [18]: plt.show()
This produces the following image:
This is exactly what I want, except instead of making three subplots, I want all the plots to be in one plot (with different colors for USA, EUR, and FIJ). I've tried
In [17]: df.groupby("cohort").boxplot(column="density",subplots=False,by="gene_id")
but it produces the error
KeyError: 'gene_id'
I think the problem has something to do with the fact that by="gene_id" is a keyword sent to the matplotlib boxplot method. If someone has a better way of producing the plot I am after, perhaps by using DataFrame.boxplot(?) instead, please respond here. Thanks so much!
To use the pure pandas functions, I think you should not GroupBy before calling boxplot, but instead, request to group by certain columns in the call to boxplot on the DataFrame itself:
df.boxplot(column='density',by=['gene_id','cohort'])
To get a better-looking result, you might want to consider using the Seaborn library. It is designed to help precisely with this sort of tasks:
sns.boxplot(data=df,x='gene_id',y='density',hue='cohort')
EDIT to take into account comment below
If you want to have each of your cohort boxplots stacked/superimposed for each gene_id, it's a bit more complicated (plus you might end up with quite an ugly output). You cannot do this using Seaborn, AFAIK, but you could with pandas directly, by using the position= parameter to boxplot (see doc). The catch it to generate the correct sequence of positions to place the boxplots where you want them, but you'll have to fix the tick labels and the legend yourself.
pos = [i for i in range(len(df.gene_id.unique())) for _ in range(len(df.cohort.unique()))]
df.boxplot(column='density',by=['gene_id','cohort'],positions=pos)
An alternative would be to use seaborn.swarmplot instead of using boxplot. A swarmplot plots every point instead of the synthetic representation of boxplots, but you can use the parameter split=False to get the points colored by cohort but stacked on top of each other for each gene_id.
sns.swarmplot(data=df,x='gene_id',y='density',hue='cohort', split=False)
Without knowing the actual content of your dataframe (number of points per gene and per cohort, and how separate they are in each cohort), it's hard to say which solution would be the most appropriate.