Consider my series as below: First column is article_id and the second column is frequency count.
article_id
1 39
2 49
3 187
4 159
5 158
...
16947 14
16948 7
16976 2
16977 1
16978 1
16980 1
Name: article_id, dtype: int64
I got this series from a dataframe with the following command:
logs.loc[logs['article_id'] <= 17029].groupby('article_id')['article_id'].count()
logs is the dataframe here and article_id is one of the columns in it.
How do I plot a bar chart(using Matlplotlib) such that the article_id is on the X-axis and the frequency count on the Y-axis ?
My natural instinct was to convert it into a list using .tolist() but that doesn't preserve the article_id.
IIUC you need Series.plot.bar:
#pandas 0.17.0 and above
s.plot.bar()
#pandas below 0.17.0
s.plot('bar')
Sample:
import pandas as pd
import matplotlib.pyplot as plt
s = pd.Series({16976: 2, 1: 39, 2: 49, 3: 187, 4: 159,
5: 158, 16947: 14, 16977: 1, 16948: 7, 16978: 1, 16980: 1},
name='article_id')
print (s)
1 39
2 49
3 187
4 159
5 158
16947 14
16948 7
16976 2
16977 1
16978 1
16980 1
Name: article_id, dtype: int64
s.plot.bar()
plt.show()
The new pandas API suggests the following way:
import pandas as pd
s = pd.Series({16976: 2, 1: 39, 2: 49, 3: 187, 4: 159,
5: 158, 16947: 14, 16977: 1, 16948: 7, 16978: 1, 16980: 1},
name='article_id')
s.plot(kind="bar", figsize=(20,10))
If you are working on Jupyter, you don't need the matplotlib library.
Just use 'bar' in kind parameter of plot
Example
series = read_csv('BwsCount.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
series.plot(kind='bar')
Default value of kind is 'line' (ie. series.plot() --> will automatically plot line graph)
For your reference:
kind : str
‘line’ : line plot (default)
‘bar’ : vertical bar plot
‘barh’ : horizontal bar plot
‘hist’ : histogram
‘box’ : boxplot
‘kde’ : Kernel Density Estimation plot
‘density’ : same as ‘kde’
‘area’ : area plot
‘pie’ : pie plot
Related
Issue
I have cumulative totals in row 751 in my dataframe
I want to create a pie chart with numbers and % on just line 751
This is my code
import matplotlib.pyplot as plt
%matplotlib notebook
data = pd.read_csv('cleaned_df.csv')
In my .csv I have the following Columns
A,B,C,D,E,F
Rows under Columns(Letters) Rows( Numbers )
A= 123456
B= 234567
C= 345678
D= 456789
E= 56789
F= 123454
Lets say I want to create a pie chat with only Column B & D and F and the last row of numbers which would be row 6 (678994)
How do I go about that ?
Possible solution is the following:
import matplotlib.pyplot as plt
import pandas as pd
# set test data and create dataframe
data = {"Date": ["01/01/2022", "01/02/2022", "01/03/2022", "01/04/2022", ], "Male": [1, 2, 3, 6], "Female": [2, 2, 3, 7], "Unknown": [3, 2, 4, 9]}
df = pd.DataFrame(data)
Returns (where 3 is the target row for chart)
# set target row index, use 751 in your case
target_row_index = 3
# make the pie circular by setting the aspect ratio to 1
plt.figure(figsize=plt.figaspect(1))
# specify data for chart
values = df.iloc[target_row_index, 1:]
labels = df.columns[1:]
# define function to format values on chart
def make_autopct(values):
def my_autopct(pct):
total = sum(values)
val = int(round(pct*total/100.0))
return '{p:.2f}% ({v:d})'.format(p=pct,v=val)
return my_autopct
plt.pie(values, labels=labels, autopct=make_autopct(values))
plt.show()
Shows
I have following dataset:
import pandas as pd
import matplotlib.pyplot as plt
dict = {'time':["2017-01-02", "2017-01-03", "2017-01-04", "2017-01-05", "2017-01-06"],'val':[3.2, 10.2, 11.3, 4.9, 2.3],
'class': [0, 1, 1, 0,0]}
df = pd.DataFrame(dict)
df
time val class
0 2017-01-02 3.2 0
1 2017-01-03 10.2 1
2 2017-01-04 11.3 1
3 2017-01-05 4.9 0
4 2017-01-06 2.3 0
I want to plot line for column "val", keeping x axis as 'df.time', meanwhile changing color of line based on 'class' column(when it is zero then for example blue line, when it is 1 then it changes color to red). my plot is as following
but desired is something like this:
Thanks!
Like in this question, you will just need to plot a bunch of lines:
# recommend
df['time'] = pd.to_datetime(df['time'])
plt.figure(figsize=(10,6))
for i in range(1,len(df)):
s = df.iloc[i-1:i+1]
color = 'r' if s['class'].eq(1).all() else 'C0'
plt.plot(s['time'], s['val'], c=color)
plt.show()
Output:
For when you have a lot of rows, it might be better to use scatter:
plt.scatter(df['time'], df['val'],
color=np.where(df['class'], 'r','C0')
)
Output (will look better with 10k rows):
I am a bit confused about what sort of package to use in order to plot my data which typically consists of 10 different categories (e.g. Temperatures) with 3 or 4 parallel measurements each. Here I have tried just using pandas (Trial1+2) and seaborn (Trial3).
In the end, what I would like to have is a scatterplot showing the three measurements from each category, and additionally drawing an average line through all my data (see example A and B below in figure).
I know that I can place my data in a CSV file which I can import using the PANDAS package in jupyter notebook. Then I get to my problem; which I think now might be related to indexing or data types? I get a lot of error that x must equal y, or that the index 'Degrees' is not defined... I will show the most successful trials below.
I have tried several things so far using this made up dataset 'Dummydata' which is very representative for the type of things I will do with my real data.
My test CSV File:
Its a .CSV file with four columns, where the first is the temperature, then the three next columns are the first, second and third measurement from corresponding temperature (y1, y2, y3).
in[]: Dummydata.to_dict()
Out[]:
{'Degrees': {0: 0,
1: 10,
2: 20,
3: 30,
4: 40,
5: 50,
6: 60,
7: 70,
8: 80,
9: 90},
'y1': {0: 20, 1: 25, 2: 34, 3: 35, 4: 45, 5: 70, 6: 46, 7: 20, 8: 10, 9: 15},
'y2': {0: 20, 1: 24, 2: 32, 3: 36, 4: 41, 5: 77, 6: 48, 7: 23, 8: 19, 9: 16},
'y3': {0: 18, 1: 26, 2: 36, 3: 37, 4: 42, 5: 75, 6: 46, 7: 21, 8: 15, 9: 16}}
Trial 1: trying to achieve a scatterplot
import pandas as pd
import matplotlib.pyplot as plt
Dummydata = pd.read_csv('DummyData.csv','r',delimiter=(';'), header=0)
y = ['y1','y2','y3']
x = ['Degrees']
Dummydata.plot(x,y)
This will give a nice line plot but also produce the UserWarning: Pandas doesn't allow columns to be created via a new attribute name (??).
If I change the plot to Dummydata.plot.scatter(x,y) then I get the error: x and y must be the same size... So I know that the shape of my data is (10,4) because of 10 rows and 4 column, how can I redefine this to be okay for pandas?
Trial 2: same thing small adjustments
import pandas as pd
import matplotlib.pyplot as plt
#import the .csv file, and set deliminator to ; and set the header as the first line(0)
Dummydata = pd.read_csv('DummyData.csv','r',delimiter=(';'), header = 0)
x =('Degrees')
y1 =('y1')
y2 =('y2')
y3 =('y3')
Dummydata.plot([x,y3]) #works fine for one value, but prints y1 and y2 ?? why?
Dummydata.plot([x,y1]) # also works, but print out y2 and y3 ??? why? # also works but prints out y2 and y3 ?? why?
Dummydata.plot([x,y]) # get error all arrays must be same length?
Dummydata.plot.scatter([x,y]) # many error, no plot
Somehow I must tell pandas that the data shape (10,4) is okay? Not sure what im doing wrong here.
Trial 3: using seaborn and try to get a scatterplot
I simply started to make a Factorplot, where I again came to the same problem of being able to get more than one y value onto my graph. I dont think converting this to a scatter would be hard if I just know how to add more data onto one graph.
import seaborn as sns
import matplotlib.pyplot as plt
#import the .csv file using pandas
Dummydata = pd.read_csv('DummyData.csv', 'r', delimiter=(';'))
#Checking what the file looks like
#Dummydata.head(2)
x =('Degrees')
y1 =('y1')
y2 =('y2')
y3 =('y3')
y =(['y1','y2','y3'])
Factorplot =sns.factorplot(x='Degrees',y='y1',data=Dummydata)
The Factor plot works fine for one dataset, however, trying to add more y value (either defining y =(['y1','y2','y3']) before or in the plotting, I get errors like: Could not interpret input 'y'.. For instance for this input:
Factorplot =sns.factorplot(x='Degrees',y='y',data=Dummydata)
or
Factorplot =sns.factorplot(x='Degrees',y=(['y1','y2','y3']),data=Dummydata)
#Error: cannot copy sequence with size 3 to array axis with dimension 10
What I would like to achieve is something like this:, where in (A) I would like a scatterplot with a rolling mean average - and in (B) I would like to plot the average only from each category but also showing the standard deviation, and additional draw a rolling mean across each category as following:
I dont want to write my data values in manually, I want to import then using .csv file (because the datasets can become very big).
Is there something wrong with the way I am organising my csv file?
All help appreciated.
Compute rolling statistics with rolling. Compute mean and standard deviation with meanand std. Plot data with plot. Add y-error bars with the yerr keyword argument.
data = data.set_index('Degrees').rolling(window=6).mean()
mean = data.mean(axis='columns')
std = data.std(axis='columns')
ax = mean.plot()
data.plot(style='o', ax=ax)
plt.figure()
mean.plot(yerr=std, capsize=3)
I want reshape my data vector, but when I running the code
from pandas import read_csv
import numpy as np
#from pandas import Series
#from matplotlib import pyplot
series =read_csv('book1.csv', header=0, parse_dates=[0], index_col=0, squeeze=True)
A= np.array(series)
B = np.reshape(10,10)
print (B)
I found error
result = getattr(asarray(obj), method)(*args, **kwds)
ValueError: total size of new array must be unchanged
my data
Month xxx
1749-01 58
1749-02 62.6
1749-03 70
1749-04 55.7
1749-05 85
1749-06 83.5
1749-07 94.8
1749-08 66.3
1749-09 75.9
1749-10 75.5
1749-11 158.6
1749-12 85.2
1750-01 73.3
.... ....
.... ....
There seem to be two issues with what you are trying to do. The first relates to how you read the data in pandas:
series = read_csv('book1.csv', header=0, parse_dates=[0], index_col=0, squeeze=True)
print(series)
>>>>Empty DataFrame
Columns: []
Index: [1749-01 58, 1749-02 62.6, 1749-03 70, 1749-04 55.7, 1749-05 85, 1749-06 83.5, 1749-07 94.8, 1749-08 66.3, 1749-09 75.9, 1749-10 75.5, 1749-11 158.6, 1749-12 85.2, 1750-01 73.3]
This isn't giving you a column of floats in a dataframe with the dates the index, it is putting each line into the index, dates and value. I would think that you want to add delimtier=' ' so that it splits the lines properly:
series =read_csv('book1.csv', header=0, parse_dates=[0], index_col=0, delimiter=' ', squeeze=True)
>>>> Month
1749-01-01 58.0
1749-02-01 62.6
1749-03-01 70.0
1749-04-01 55.7
1749-05-01 85.0
1749-06-01 83.5
1749-07-01 94.8
1749-08-01 66.3
1749-09-01 75.9
1749-10-01 75.5
1749-11-01 158.6
1749-12-01 85.2
1750-01-01 73.3
Name: xxx, dtype: float64
This gives you the dates as the index with the 'xxx' value in the column.
Secondly the reshape. The error is quite descriptive in this case. If you want to use numpy.reshape you can't reshape to a layout that has a different number of elements to the original data. For example:
import numpy as np
a = np.array([1, 2, 3, 4, 5, 6]) # size 6 array
a.reshape(2, 3)
>>>> [[1, 2, 3],
[4, 5, 6]]
This is fine because the array starts out length 6, and I'm reshaping to 2 x 3, and 2 x 3 = 6.
However, if I try:
a.reshape(10, 10)
>>>> ValueError: cannot reshape array of size 6 into shape (10,10)
I get the error, because I need 10 x 10 = 100 elements to do this reshape, and I only have 6.
Without the complete dataset it's impossible to know for sure, but I think this is the same problem you are having, although you are converting your whole dataframe to a numpy array.
I'm trying to produce a plot from a dataframe in iPython Notebook, but the command doesn't execute. The dataframe(StatePremiums) looks like this:
index StateCode PremiumAdultIndividualAge30 YearlyAverage
0 0 AK 633 7596
1 1 AK 755 9060
2 2 AK 916 10992
3 3 AK 803 9636
4 4 AK 785 9420
When I try to plot using the following line, the kernel doesn't execute, it just keeps running without end. This isn't a display/show issue.
StatePremiumAverages.plot(kind="barh",x=StatePremiumAverages["StateCode"],
title="Average Yearly Health Premiums for Individuals, Age 30", legend=False)
What could be the issue?
Use %matplotlib inline as the first line of your notebook. The following works for me:
%matplotlib inline
import pandas as pd
StatePremiumAverages = pd.DataFrame({
'index': [0, 1, 2, 3, 4],
'StateCode': ['AK', 'AK', 'AK', 'AK', 'AK'],
'PremiumAdultIndividualAge30' : [633, 755, 916, 803, 785],
'YearlyAverage' : [7596, 9060, 10992, 9636, 9420]
})
StatePremiumAverages.plot(kind="barh",x=StatePremiumAverages["StateCode"],
title="Average Yearly Health Premiums for Individuals, Age 30", legend=False)