iPython Notebook Pandas Not Plotting - pandas

I'm trying to produce a plot from a dataframe in iPython Notebook, but the command doesn't execute. The dataframe(StatePremiums) looks like this:
index StateCode PremiumAdultIndividualAge30 YearlyAverage
0 0 AK 633 7596
1 1 AK 755 9060
2 2 AK 916 10992
3 3 AK 803 9636
4 4 AK 785 9420
When I try to plot using the following line, the kernel doesn't execute, it just keeps running without end. This isn't a display/show issue.
StatePremiumAverages.plot(kind="barh",x=StatePremiumAverages["StateCode"],
title="Average Yearly Health Premiums for Individuals, Age 30", legend=False)
What could be the issue?

Use %matplotlib inline as the first line of your notebook. The following works for me:
%matplotlib inline
import pandas as pd
StatePremiumAverages = pd.DataFrame({
'index': [0, 1, 2, 3, 4],
'StateCode': ['AK', 'AK', 'AK', 'AK', 'AK'],
'PremiumAdultIndividualAge30' : [633, 755, 916, 803, 785],
'YearlyAverage' : [7596, 9060, 10992, 9636, 9420]
})
StatePremiumAverages.plot(kind="barh",x=StatePremiumAverages["StateCode"],
title="Average Yearly Health Premiums for Individuals, Age 30", legend=False)

Related

matplotlib. Change color of line based on values in other column keeping x axis same

I have following dataset:
import pandas as pd
import matplotlib.pyplot as plt
dict = {'time':["2017-01-02", "2017-01-03", "2017-01-04", "2017-01-05", "2017-01-06"],'val':[3.2, 10.2, 11.3, 4.9, 2.3],
'class': [0, 1, 1, 0,0]}
df = pd.DataFrame(dict)
df
time val class
0 2017-01-02 3.2 0
1 2017-01-03 10.2 1
2 2017-01-04 11.3 1
3 2017-01-05 4.9 0
4 2017-01-06 2.3 0
I want to plot line for column "val", keeping x axis as 'df.time', meanwhile changing color of line based on 'class' column(when it is zero then for example blue line, when it is 1 then it changes color to red). my plot is as following
but desired is something like this:
Thanks!
Like in this question, you will just need to plot a bunch of lines:
# recommend
df['time'] = pd.to_datetime(df['time'])
plt.figure(figsize=(10,6))
for i in range(1,len(df)):
s = df.iloc[i-1:i+1]
color = 'r' if s['class'].eq(1).all() else 'C0'
plt.plot(s['time'], s['val'], c=color)
plt.show()
Output:
For when you have a lot of rows, it might be better to use scatter:
plt.scatter(df['time'], df['val'],
color=np.where(df['class'], 'r','C0')
)
Output (will look better with 10k rows):

How to make a plot from method read_html of Pandas on Python 2.7?

I'm trying to make a plot (whichever) and cannot see method .plot() and also i'm getting this traceback: (The data is a print of df)
[ 2019 I II III IV
Total
3373 Barrio1 1175 1117 1081 Â
8079 Barrio2 2651 2570 2858 Â
3839 Barrio232 1364 1237 1238 Â
1762 Barrio2342342 544 547 671 Â
3946 Barrio224235 1257 1291 1398 Â
Traceback (most recent call last):
File "D:/Users/str_leu/Documents/PycharmProjects/flask/graphs.py", line 13, in <module>
plt.scatter(df['barrios'], df['leuros'])
TypeError: list indices must be integers, not str
Process finished with exit code 1
and the code is:
import pandas
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
table = BeautifulSoup(open('./PycharmProjects/flask/tables.html', 'r').read(), features="lxml").find('table')
df = pandas.read_html(str(table), decimal=',', thousands='.', index_col=0)
print df
plt.scatter(df['barrios'], df['euros'])
plt.show()
UPDATED
df = pandas.read_html(str(table), decimal=',', thousands='.', index_col=2, header=1)
At the end i found how to deal with it but the problem is the last column (strange character) anyone know how to skip it?
UPDATED2
[ District2352 1.175 1.117 1.081 Unnamed: 5
3.373
8079 District23422 2651 2570 2858 NaN
3839 District7678 1364 1237 1238 NaN
1762 Distric3 544 547 671 NaN
3946 dISTRICT1 1257 1291 1398 NaN
Need to drop last column (entire) but dont know the process to pass from read_html method of pandas to DataFrame and then draw a plot...
UPDATED 3
2019 I II III IV
Total
3373 dISTRICT1 1175 1117 1081 NaN
8079 District2 2651 2570 2858 NaN
This is an example with the headers
pandas.read_html returns a list of DataFrames. Currently you're trying to access the list using an str, which is causing the error. Depending on your requirements, you can either plot columns from each using a for loop, or combine the dataframes in someway using pd.concat
import seaborn as sns
# If each dataframe holds the same columns you want to plot
dfs = pandas.read_html(str(table), decimal=',', thousands='.', index_col=0)
for df in dfs:
# you would need to individually define the plot you want
df["2019"].value_counts().plot(kind='bar')
df.plot(x='I', y='II') # etc
# you could also try seaborn's pairplot. This will omit categorical data
sns.pairplot(df)
SOLUTION
dfs = pandas.read_html(str(table), decimal=',', thousands='.', header=1, index_col=1, encoding='utf-8').pop(0)
print dfs
x=[]
y=[]
y1=[]
y2=[]
for i, row in dfs.iterrows():
x.append(row[0])
y.append(int(row[1]))
y1.append(int(row[2]))
y2.append(int(row[3]))
plt.plot(x,y)
plt.plot(x,y1)
plt.plot(x,y2)
plt.show()

Pandas replace not working even with inplace=True

I am using replace function, but it does not work. It is not doing the replacement,I still see original string. On the pandas documentation, the replace function does not even have an inplace argument, so I wonder if inplace actually works?
df["Name"].replace(["Bill"], "William", inplace=True)
I still see: Bill
Try the following, passing your rename as a dictionary:
import pandas as pd
df = pd.DataFrame({'Name': ['Bill','James','Joe','John','Bill'], 'Age': [34, 21, 34, 45, 23]})
df.replace({'Bill': 'William'}, inplace=True)
#OR
df['Name'].replace({'Bill': 'William'}, inplace=True)
Indeed, this produces:
Name Age
0 William 34
1 James 21
2 Joe 34
3 John 45
4 William 23

ValueError: total size of new array must be unchanged (numpy for reshape)

I want reshape my data vector, but when I running the code
from pandas import read_csv
import numpy as np
#from pandas import Series
#from matplotlib import pyplot
series =read_csv('book1.csv', header=0, parse_dates=[0], index_col=0, squeeze=True)
A= np.array(series)
B = np.reshape(10,10)
print (B)
I found error
result = getattr(asarray(obj), method)(*args, **kwds)
ValueError: total size of new array must be unchanged
my data
Month xxx
1749-01 58
1749-02 62.6
1749-03 70
1749-04 55.7
1749-05 85
1749-06 83.5
1749-07 94.8
1749-08 66.3
1749-09 75.9
1749-10 75.5
1749-11 158.6
1749-12 85.2
1750-01 73.3
.... ....
.... ....
There seem to be two issues with what you are trying to do. The first relates to how you read the data in pandas:
series = read_csv('book1.csv', header=0, parse_dates=[0], index_col=0, squeeze=True)
print(series)
>>>>Empty DataFrame
Columns: []
Index: [1749-01 58, 1749-02 62.6, 1749-03 70, 1749-04 55.7, 1749-05 85, 1749-06 83.5, 1749-07 94.8, 1749-08 66.3, 1749-09 75.9, 1749-10 75.5, 1749-11 158.6, 1749-12 85.2, 1750-01 73.3]
This isn't giving you a column of floats in a dataframe with the dates the index, it is putting each line into the index, dates and value. I would think that you want to add delimtier=' ' so that it splits the lines properly:
series =read_csv('book1.csv', header=0, parse_dates=[0], index_col=0, delimiter=' ', squeeze=True)
>>>> Month
1749-01-01 58.0
1749-02-01 62.6
1749-03-01 70.0
1749-04-01 55.7
1749-05-01 85.0
1749-06-01 83.5
1749-07-01 94.8
1749-08-01 66.3
1749-09-01 75.9
1749-10-01 75.5
1749-11-01 158.6
1749-12-01 85.2
1750-01-01 73.3
Name: xxx, dtype: float64
This gives you the dates as the index with the 'xxx' value in the column.
Secondly the reshape. The error is quite descriptive in this case. If you want to use numpy.reshape you can't reshape to a layout that has a different number of elements to the original data. For example:
import numpy as np
a = np.array([1, 2, 3, 4, 5, 6]) # size 6 array
a.reshape(2, 3)
>>>> [[1, 2, 3],
[4, 5, 6]]
This is fine because the array starts out length 6, and I'm reshaping to 2 x 3, and 2 x 3 = 6.
However, if I try:
a.reshape(10, 10)
>>>> ValueError: cannot reshape array of size 6 into shape (10,10)
I get the error, because I need 10 x 10 = 100 elements to do this reshape, and I only have 6.
Without the complete dataset it's impossible to know for sure, but I think this is the same problem you are having, although you are converting your whole dataframe to a numpy array.

How to plot a bar graph from a pandas series?

Consider my series as below: First column is article_id and the second column is frequency count.
article_id
1 39
2 49
3 187
4 159
5 158
...
16947 14
16948 7
16976 2
16977 1
16978 1
16980 1
Name: article_id, dtype: int64
I got this series from a dataframe with the following command:
logs.loc[logs['article_id'] <= 17029].groupby('article_id')['article_id'].count()
logs is the dataframe here and article_id is one of the columns in it.
How do I plot a bar chart(using Matlplotlib) such that the article_id is on the X-axis and the frequency count on the Y-axis ?
My natural instinct was to convert it into a list using .tolist() but that doesn't preserve the article_id.
IIUC you need Series.plot.bar:
#pandas 0.17.0 and above
s.plot.bar()
#pandas below 0.17.0
s.plot('bar')
Sample:
import pandas as pd
import matplotlib.pyplot as plt
s = pd.Series({16976: 2, 1: 39, 2: 49, 3: 187, 4: 159,
5: 158, 16947: 14, 16977: 1, 16948: 7, 16978: 1, 16980: 1},
name='article_id')
print (s)
1 39
2 49
3 187
4 159
5 158
16947 14
16948 7
16976 2
16977 1
16978 1
16980 1
Name: article_id, dtype: int64
s.plot.bar()
plt.show()
The new pandas API suggests the following way:
import pandas as pd
s = pd.Series({16976: 2, 1: 39, 2: 49, 3: 187, 4: 159,
5: 158, 16947: 14, 16977: 1, 16948: 7, 16978: 1, 16980: 1},
name='article_id')
s.plot(kind="bar", figsize=(20,10))
If you are working on Jupyter, you don't need the matplotlib library.
Just use 'bar' in kind parameter of plot
Example
series = read_csv('BwsCount.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
series.plot(kind='bar')
Default value of kind is 'line' (ie. series.plot() --> will automatically plot line graph)
For your reference:
kind : str
‘line’ : line plot (default)
‘bar’ : vertical bar plot
‘barh’ : horizontal bar plot
‘hist’ : histogram
‘box’ : boxplot
‘kde’ : Kernel Density Estimation plot
‘density’ : same as ‘kde’
‘area’ : area plot
‘pie’ : pie plot