I'm able to create the following boxplot with Pandas pandas.DataFrame.boxplot() method:
import pandas as pd
import numpy as np
np.random.seed(1234)
df = pd.DataFrame(np.random.rand(10, 4),
columns=['Col1', 'Col2', 'Col3', 'Col4'])
df.plot.box()
plt.show()
Although, if I try to do the same using HoloViews' BoxWhisker Element with Bokeh as backend, it works fine for a single column:
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
hv.BoxWhisker(
data=df['Col1'],
vdims='Col1'
)
But as soon as I try to add just another column, I get the below error:
hv.BoxWhisker(
data=df[['Col1', 'Col2']]
)
DataError: None of the available storage backends were able to support the supplied data format. PandasInterface raised following error:
unsupported operand type(s) for +: 'NoneType' and 'int'
PandasInterface expects tabular data, for more information on supported datatypes see http://holoviews.org/user_guide/Tabular_Datasets.html
I can't understand whether there is a problem with the Tabular Data HoloViews understands, or I'm not able to apply the syntaxis properly.
I would also recommend James Bednar's answer, which is using hvPlot. HvPlot is built on top of HoloViews:
import hvplot.pandas
df.hvplot.box()
However, if you want to do it in HoloViews instead of hvPlot, you would have to melt your data to get all column names in one column, and all values in the other column.
This code works for your sample data:
hv.BoxWhisker(df.melt(), kdims='variable', vdims='value')
I'm not sure how to achieve what you want from the native HoloViews BoxWhisker interface, which is set up for tidy data rather than a set of independent columns like that. Meanwhile you can use hvPlot just as you use the native .plot() call:
Related
I have a partition column in my Hive-style partitioned parquet dataset (written by PyArrow from Pandas Dataframe) with an entry like "TYPE=3860877578". When trying to read from this dataset, I get an error:
ArrowInvalid: error parsing '3760212050' as scalar of type int32
This is the first partition key that won't fit into an int32 (i.e. there are smaller integer values in other partitions - I think the inference must be done on the first one encountered). It looks like it should be possible to override the inferred type (to int64 or even string) at the dataset level, but I can't figure out how to get there from here :-) So far I have been using the Pandas.read_parquet() interface, and passing down filters, columns, etc. to PyArrow. I think I will need to use the PyArrow APIs directly, but don't know where to start.
How can I tell PyArrow to treat this column as an int64 or string type instead of trying to infer the type?
Example of dataset partition values that causes this problem:
/mydataset/TYPE=12345/*.parquet
/mydataset/TYPE=3760212050/*.parquet
Code that reproduces the problem with Pandas 1.1.1 and PyArrow 1.0.1:
import pandas as pd
# pyarrow is available and used
df = pd.read_parquet("mydataset")
The issue can't be avoided by avoiding the problematic value with filtering because the partition values all appear to be parsed prior to filtering, i.e
import pandas as pd
# pyarrow is available and used
df = pd.read_parquet("mydataset", filters=[[('TYPE','=','12345')]])
I figured out since my original post that I can do what I want with the PyArrow API directly like this:
from pyarrow.dataset import HivePartitioning
from pyarrow.parquet import ParquetDataset
import pyarrow as pa
partitioning = HivePartitioning(pa.schema([("TYPE", pa.int64())]))
df = ParquetDataset("mydataset",
filters=filters,
partitioning=partitioning,
use_legacy_dataset=False).read_pandas().to_pandas()
I'd like to be able to pass that info down through the Pandas read_parquet() interface but it doesn't appear possible at this time.
I am new to Pandas and doing some analysis csv file. I have successfully read csv and shown all details. I have got two column as an object type which I need to plot. I have done groupy for those two columns and getting first and all data, However I am not sure, how to do plotting for these object types in Pandas. Below is my sample of Groupby and smaple for event_type and event_description for which I need to do plotting. If I can plot for Application and Network for event_type that will be great help
import pandas as pd
data = pd.read_csv('/Users/temp/Downloads/sample.csv’)
data.head()
grouped_df = data.groupby([ "event_type", "event_description"])
grouped_df.first()
As commented - need more info, but IIUC, try:
df['event_type'].value_counts(sort=True).plot(kind='barh')
I have a pandas dataframe with two columns of time series data. In my actual data, these columns are large enough that the render is unwieldy without datashader. I am attempting to compare events from these two timeseries. However, I need to be able to tell which data point is from which column. A simple functional example is below. How would I get columns A and B to use different color maps?
import numpy as np
import hvplot.pandas
import pandas as pd
A = np.random.randint(10, size=10000)
B = np.random.randint(30, size=10000)
d = {'A':A,'B':B}
df = pd.DataFrame(d)
df.hvplot(kind='scatter',datashade=True, height=500, width=1000, dynspread=False)
You will have to use the count_cat aggregator that counts each category separately, e.g. in the example above that would look like this:
import datashader as ds
df.hvplot(kind='scatter', aggregator=ds.count_cat('Variable'), datashade=True,
height=500, width=1000)
The 'Variable' here corresponds to the default group_label that hvplot assigns to the columns. If you provided a different group_label you would have to update the aggregator to match. However instead of supplying an aggregator explicitly you can also use the by keyword:
df.hvplot(kind='scatter', by='Variable', datashade=True,
height=500, width=1000)
Once hvplot 0.3.1 is released you'll also be able to supply an explicit cmap, e.g.:
df.hvplot(kind='scatter', by='Variable', datashade=True,
height=500, width=1000, cmap={'A': 'red', 'B': 'blue'})
I am looking for a way to do assertive programming on pandas dataframes data as does assertr in R.
Are there any convenient library for that ?
All Advice are very welcome.
I don't know of analogous libraries that integrate specifically with Pandas, but assert is a built-in keyword in Python, which you could use to validate data at various points in your data pipeline.
The syntax is simply:
assert [condition]
If true, nothing happens. If false, an AssertionError is raised.
To validate Pandas data you could write a statement like this:
import pandas as pd
import seaborn as sns
iris = sns.load_dataset('iris')
# throws an exception if there are negative values in the sepal_length column
assert (iris['sepal_length'] > 0).all()
I found an answer to my own question: engarde exactly what I was looking for.
The math problem that I'm solving gives different analytical solutions in different scenarios, and I would like to summarize the result in a nice table. IPython Notebook renders the list nicely:
for example:
import sympy
from pandas import DataFrame
from sympy import *
init_printing()
a, b, c, d = symbols('a b c d')
t = [[a/b, b/a], [c/d, d/c]]
t
However, when I summarize the answers into a table using DataFrame, the math cannot be rendered any more:
df = DataFrame(t, index=['Situation 1', 'Situation 2'], columns=['Answer1','Answer2'])
df
"print df.to_latex()" also gives the same result. I also tried "print(latex(t))" but it gives this after compiling in LaTex, which is alright, but I still need to manually convert it to a table:
How should I use DataFrame properly in order to render the math properly? Or is there any other way to export the math result into a table in Latex? Thanks!
Update: 01/25/14
Thanks again to #Jakob for solving the problem. It works perfectly for simple matrices, though there are still some minor problems for more complicated math expressions. But I guess like #asmeurer said, perfection requires an update in IPython and Pandas.
Update: 01/26/14
If I render the result directly, i.e. just print the list, it works fine:
MathJax is currently not able to render tables, hence the most obvious approach (pure latex) does not work.
However, following the advise of #asmeurer you should use an html table and render the cell content as latex. In your case this could be easily achieved by the following intermediate step:
from sympy import latex
tl = map(lambda tc: '$'+latex(tc)+'$',t)
df = DataFrame(tl, index=['Situation 1', 'Situation 2'], columns=['Answer'])
df
which gives:
Update:
In case of two dimensional data, the simple map function will not work directly. To cope with this situation the numpy shape, reshape and ravel functions could be used like:
import numpy as np
t = [[a/b, b/a],[a*a,b*b]]
tl=np.reshape(map(lambda tc: '$'+latex(tc)+'$',np.ravel(t)),np.shape(t))
df = DataFrame(tl, index=['Situation 1', 'Situation 2'], columns=['Answer 1','Answer 2'])
df
This gives:
Update 2:
Pandas crops cell content if the string length exceeds a certain number. E.g a more complicated expression like
t1 = [a/2+b/2+c/2+d/2]
tl=np.reshape(map(lambda tc: '$'+latex(tc)+'$',np.ravel(t1)),np.shape(t1))
df = DataFrame(tl, index=['Situation 1'], columns=['Answer 1'])
df
gives:
To cope with this issue a pandas package option has to be altered, for details see here. For the present case the max_colwidth has to be changed. The default value is 50, hence let's change it to 100:
import pandas as pd
pd.options.display.max_colwidth=100
df
gives: