Pandas- Groupby Plot is not working for object - pandas

I am new to Pandas and doing some analysis csv file. I have successfully read csv and shown all details. I have got two column as an object type which I need to plot. I have done groupy for those two columns and getting first and all data, However I am not sure, how to do plotting for these object types in Pandas. Below is my sample of Groupby and smaple for event_type and event_description for which I need to do plotting. If I can plot for Application and Network for event_type that will be great help
import pandas as pd
data = pd.read_csv('/Users/temp/Downloads/sample.csv’)
data.head()
grouped_df = data.groupby([ "event_type", "event_description"])
grouped_df.first()

As commented - need more info, but IIUC, try:
df['event_type'].value_counts(sort=True).plot(kind='barh')

Related

Save output in CSV without losing previous data on that CSV in pandas dataframe

I'm doing sentiment analysis of Tweeter data. For this work, I've made some datasets in CSV format where different month in different dataset. When I do the preprocessing of every dataset individually, I want to save all dataset in 1 single CSV file. but when I write the below's code by using pandas dataframe:
df.to_csv('dataset.csv', index=False)
It removes previous data (Rows) of that dataset. Is there any way that I can keep the previous data too on that file? So that I can merge all data together. Thank you..........
It's not entirely clear what you want from your question, so this is just a guess, but something like this might be what you're looking for. if you keep assigning dataframes to df, then new data will overwrite the old data. Try reassigning them to differently named dataframes like df1 and `df21. Then you can merge them.
# vertically merge the multiple dataframes and reassign to new variable
df = pd.concat([df1, df2])
# save the dataframe
df.to_csv('my_dataset.csv', index=False)
In python you can use the open("file") method with the parameter 'a':
open("file", 'a').
The 'a' means "append" so you will add lines at the end of your file.
You can use the same parameter for the pandas.dataFrame.to_csv() method.
e.g:
import pandas as pd
# code where you get data and return df
pd.df.to_csv("file", mode='a')
#thehand0: Your code works, but it's inefficient, so it will take longer for your script to run.

HoloViews: create boxplots for every column in a pandas dataframe

I'm able to create the following boxplot with Pandas pandas.DataFrame.boxplot() method:
import pandas as pd
import numpy as np
np.random.seed(1234)
df = pd.DataFrame(np.random.rand(10, 4),
columns=['Col1', 'Col2', 'Col3', 'Col4'])
df.plot.box()
plt.show()
Although, if I try to do the same using HoloViews' BoxWhisker Element with Bokeh as backend, it works fine for a single column:
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
hv.BoxWhisker(
data=df['Col1'],
vdims='Col1'
)
But as soon as I try to add just another column, I get the below error:
hv.BoxWhisker(
data=df[['Col1', 'Col2']]
)
DataError: None of the available storage backends were able to support the supplied data format. PandasInterface raised following error:
unsupported operand type(s) for +: 'NoneType' and 'int'
PandasInterface expects tabular data, for more information on supported datatypes see http://holoviews.org/user_guide/Tabular_Datasets.html
I can't understand whether there is a problem with the Tabular Data HoloViews understands, or I'm not able to apply the syntaxis properly.
I would also recommend James Bednar's answer, which is using hvPlot. HvPlot is built on top of HoloViews:
import hvplot.pandas
df.hvplot.box()
However, if you want to do it in HoloViews instead of hvPlot, you would have to melt your data to get all column names in one column, and all values in the other column.
This code works for your sample data:
hv.BoxWhisker(df.melt(), kdims='variable', vdims='value')
I'm not sure how to achieve what you want from the native HoloViews BoxWhisker interface, which is set up for tidy data rather than a set of independent columns like that. Meanwhile you can use hvPlot just as you use the native .plot() call:

Holoviz panel will not print pandas dataframe row in Jupyter notebook

I'm trying to recreate the first panel.interact example in the Holoviz tutorial using a Pandas dataframe instead of a Dask dataframe. I get the slider, but the pandas dataframe row does not show.
See the original example at: http://holoviz.org/tutorial/Building_Panels.html
I've tried using Dask as in the Holoviz example. Dask rows print out just fine, but it demonstrates that panel seem to treat Dask dataframe rows differently for printing than Pandas dataframe rows. Here's my minimal code:
import pandas as pd
import panel
l1 = ['a','b','c','d','a','b']
l2 = [1,2,3,4,5,6]
df = pd.DataFrame({'cat':l1,'val':l2})
def select_row(rowno=0):
row = df.loc[rowno]
return row
panel.extension()
panel.extension('katex')
panel.interact(select_row, rowno=(0, 5))
I've included a line with the katex extension, because without it, I get a warning that it is needed. Without it, I don't even get the slider.
I can call the select_row(rowno=0) function separately in a Jupyter cell and get a nice printout of the row, so it appears the function is working as it should.
Any help in getting this to work would be most appreciated. Thanks.
Got a solution. With Pandas, loc[rowno:rowno] returns a pandas.core.frame.DataFrame object of length 1 which works fine with panel while loc[rowno] returns a pandas.core.series.Series object which does not work so well. Thus modifying the select_row() function like this makes it all work:
def select_row(rowno=0):
row = df.loc[rowno:rowno]
return row
Still not sure, however, why panel will print out the Dataframe object and not the Series object.
Note: if you use iloc, then you use add +1, i.e., df.iloc[rowno:rowno+1].

Pandas Timeseries plotting

I have a Pandas timeseries object with dates and corresponding values. But, when I try to plot it, the plot is a L-shape plot (the dates and values are automatically arranged in such a way that the highest value comes first...).
This is what did to generate the plot:
df = pd.read_csv('C:\data\test1.csv') # two-column dataframe
data_list = df['values'].tolist()
dates_list = df['date'].tolist()
df_ts = pd.Series(data_list, index=dates_list)
df_ts.plot()
I am not sure where I am making a mistake. I am reading in a csv file, converting to a timeseries obj and plotting it. Any suggestions is very much appreciated.
Thanks!
PD
don't bother creating the unnecessary intermediate data structures, just organize your DataFrame better.
df['date'] = pd.to_datetime(df.date) #make sure you're actually dealing with timestamps.
df.set_index('date', inplace=True)
df.sort(inplace=True)
df.plot()

How to generate a pandas dataframe from ordereddict?

How can i generated a pandas dataframe from an ordereddict?
I have tried using the dataframe.from_dict method but that is not giving me the expected dataframe.
What is the best approach to convert an ordereddict into a list of dicts?
A bug in Pandas did not respect the key ordering of OrderedDict objects converted to a DataFrame via the from_dict call. Fixed in Pandas 0.11.