Create a heatmap from pandas dataframe - pandas

I have a pandas dataframe of the form:
colA | colB | counts
car1 plane1 23
car2 plane2 51
car1 plane2 12
car2 plane3 41
I first want to create a pandas dataframe that looks a bit like a matrix (similar to the df in this example), also filling the missing values with 0. So the desired result for the above would be:
plane1 plane2 plane3
car1 23 12 0
car2 0 51 41
And then be able to turn this into a heat map. Is there a pandas command I can use for this?

pandas.pivot_table to transform data, seaborn.heatmap to create a heatmap
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
piv = pd.pivot_table(df, index='colA', columns='colB', aggfunc='sum', fill_value=0)
piv.columns = piv.columns.droplevel(0)
sns.heatmap(piv)
plt.show()

Related

Heatmap with category values in seaborn

I have the following data frame.
ID Cat V1 V2 V3
1 A 1 1 1
2 B 1 1 1
3 A 1 1 0
4 C 0 0 0
I want to create a plot (similar to a heatmap) that shows if V1 to V3 were observed (1) or not (0).
Furthermore, each field should be colored according to the category of the row.
For example, if Cat is A, it shall be red; if Cat is B, it shall be green; and if Cat is C, it shall be blue.
Hence, in this case, all squares in the first row of the heatmap shall be red, and in the second row, they shall be green.
I want to use seaborn or matplotlib in python to create the plot.
However, I do not know what the type of plot would be.
Probably there exist less involved ways. The following approach loops through the categories; for each category a heatmap is filled with the corresponding color.
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns
import pandas as pd
import numpy as np
from io import StringIO
data_str = '''ID Cat V1 V2 V3
1 A 1 1 1
2 B 1 1 1
3 A 1 1 0
4 C 0 0 0'''
df = pd.read_csv(StringIO(data_str), delim_whitespace=True)
df = df.set_index('ID')
fig, ax = plt.subplots(figsize=(6, 6))
categories = ['A', 'B', 'C']
colors = ['crimson', 'lime', 'dodgerblue']
for cat, color in zip(categories, colors):
df_cat = df[['V1', 'V2', 'V3']][df['Cat'] == cat].reindex(df.index, fill_value=0)
df_cat[['V1', 'V2', 'V3']] = df_cat[['V1', 'V2', 'V3']].replace({0: np.nan})
if not np.all(df_cat.isna()):
sns.heatmap(data=df_cat, cmap=ListedColormap([color]), cbar=False, lw=1, alpha=1, ax=ax)
plt.show()

I want to draw a pie chart of maximums

I have a dataframe like this:
dog_stage favorite_count
8 doggo 32467
11 puppo 38818
13 puppo 15359
25 pupper 21524
34 doggo 20771
I want to draw a pie chart of the most favorite dog_stages
Group them together and draw a pie chart with the pandas plotting function.
import pandas as pd
import numpy as np
import io
data = '''
dog_stage favorite_count
8 doggo 32467
11 puppo 38818
13 puppo 15359
25 pupper 21524
34 doggo 20771
'''
df = pd.read_csv(io.StringIO(data), sep='\s+')
df.groupby('dog_stage')\['favorite_count'\].sum().plot(kind='pie', title='favorite count', ylabel='',autopct="%1.1f%%")

TypeError: '<=' not supported between instances of 'Timestamp' and 'numpy.float64'

I am trying to plot using hvplot, and I am getting this:
TypeError: '<=' not supported between instances of 'Timestamp' and 'numpy.float64'
Here is my data:
TimeConv Hospitalizations
1 2020-04-04 827
2 2020-04-05 1132
3 2020-04-06 1153
4 2020-04-07 1252
5 2020-04-08 1491
... ... ...
71 2020-06-13 2242
72 2020-06-14 2287
73 2020-06-15 2326
74 NaT NaN
75 NaT NaN
Below is my code:
import numpy as np
import matplotlib.pyplot as plt
import xlsxwriter
import pandas as pd
from pandas import DataFrame
path = ('Casecountdata.xlsx')
xl = pd.ExcelFile(path)
df1 = xl.parse('Hospitalization by Day')
df2 = df1[['Unnamed: 1','Unnamed: 2']]
df2 = df2.drop(df2.index[0])
df2 = df2.rename(columns={"Unnamed: 1": "Time", "Unnamed: 2": "Hospitalizations"})
df2['TimeConv'] = pd.to_datetime(df2.Time)
df3 = df2[['TimeConv','Hospitalizations']]
When I take a sample of your data above and try to plot it, it works for me, so there might be something wrong in the way you read your data from excel to pandas. You can try to do df.info() to see what the datatypes of your data look like. Column TimeConv should be datetime64[ns] and column Hospitalizations should be int64 (or float). Could also be a version problem... do you have the latest versions of hvplot etc installed? But my guess is, your data doesn't look right.
In any case, when I run the following, it works and plots your data:
# import libraries
import pandas as pd
import hvplot.pandas
import holoviews as hv
hv.extension('bokeh')
from io import StringIO # need this to read your text data
# your sample data
text_data = StringIO("""
column1 TimeConv Hospitalizations
1 2020-04-04 827
2 2020-04-05 1132
72 2020-06-14 2287
73 2020-06-15 2326
74 NaT NaN
""")
# read text data to dataframe
df = pd.read_csv(text_data, sep="\s+")
df['TimeConv'] = pd.to_datetime(df.TimeConv, yearfirst=True)
# shortly checkout datatypes of your data
df.info()
# create scatter plot of your data
df.hvplot.scatter(
x='TimeConv',
y='Hospitalizations',
width=500,
title='Showing hospitalizations over time',
)
This code results in the following plot:

alternatives to pivot very large table pandas

I have a dataframe of 25M x 3 cols of format:
import pandas as pd
import numpy as np
d={'ID':['A1','A1','A2','A2','A2'], 'date':['Jan 1','Jan7','Jan4','Jan5','Jan12'],'value':[10,12,3,5,2]}
df=pd.DataFrame(data=d)
df
ID date value
0 A1 Jan 1 10
1 A1 Jan7 12
2 A2 Jan4 3
3 A2 Jan5 5
4 A2 Jan12 2
...
An
And want to pivot it using:
df['date'] = pd.to_datetime(df['date'], format='%b%d')
(df.pivot(index='date', columns='ID',values='value')
.asfreq('D')
.interpolate()
.bfill()
.reset_index()
)
df.index = df.index.strftime('%b%d')
This works for 500k rows
df3=(df.iloc[:500000,:].pivot(index='date', columns='ID',values='value')
.resample('M').mean()
.interpolate()
.bfill()
.reset_index()
)
, but when I used my full data set, or >1M rows, it fails with:
ValueError: Unstacked DataFrame is too big, causing int32 overflow
Are there any suggestions on how I can get this to run to completion?
A further computation is performed on the wide table:
N=19/df2.iloc[0]
df2.mul(N.tolist(),axis=1).sum(1)

Plotting 2 columns as 2 lines and 1 column as x axis on Dataframes

I'm new to pandas and all these dataframe. I am interested to know how I could transform my current codes to plt.figure instead. I would like to plot 2 columns (Tourism Receipts, Visitors) as line while putting another column as the x axis (Quarters).
It seems that this code works. But i would like to know whether there may be a better way to do it such as plt.plot but allowing me to set the x-axis as Quarters and the other 2 columns as lines?
df1= df.set_index('Quarters').plot(figsize=(10,5), grid=True)
Dataframe (from my csv file):
| Quarters | Tourism Receipts | Visitors |
| 2019 Q1 | 10 | 1 |
| 2019 Q2 | 20 | 2 |
| 2019 Q3 | 30 | 3 |
| 2019 Q4 | 40 | 4 |
I understand this following method
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(20,10))
plt.plot(x,y)
plt.title
plt.xlabel
plt.ylabel
I would like to enquire whether there is a way to do transform the 'df.set_index' method to plt instead?
You can actually combine both, using the .plot method which saves a lot of effort from pd and use matplotlib features side-by-side to customize the output.
This is a sample code of who to address this:
from matplotlib import pyplot as plt
import pandas as pd
fig, ax = plt.subplots(1, figsize=(10, 10))
df.set_index('Quarters')[['Tourism Receipts', 'Visitors']].plot(figsize=(10,5), grid=True, ax=ax)
ax.set_yticks(range(-10, 41, 5))
# ax.set_yticklabels( ('{}%'.format(x) for x in range(0, 101, 10)), fontsize=15)
ax.set_xticks(df.Quarters)
ax.set_xticklabels(["{} Q{}".format('2019', x) for x in df.Quarters])
ax.legend(loc='lower left')
You can do the same for yticks as well.
PS: The df.Quarters doesn't include year, so I am assuming 2019.