Building a histogram - pandas

How can a distribution histogram similar to this one be constructed based on the data from the table?
enter image description here
enter image description here
Code python:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel('Data.xlsx')
print(df)
df.plot.hist(df)
plt.show()

It isn't clear exactly what the x and y axes of your desired plot are. Hopefully this will get you started. Sometimes trying to comeup with a MRE will help you solve your own problem.
import random
import pandas as pd
import matplotlib.pyplot as plt
#######################################
# generate some random data for a MWE #
#######################################
random.seed(22)
data = [random.randint(0, 100) for _ in range(0, 10)]
data = pd.Series(sorted(data))
freqs = [random.uniform(0, 1) for _ in range(0, 10)]
freqs = sorted(freqs)
freqs = pd.Series(freqs)
df = pd.DataFrame()
df['data'] = data
df['frequencies'] = freqs
###############################################
# Desired bar plot using pandas built in plot #
###############################################
df.plot(x='data', y='frequencies', kind='bar')
plt.show()

Related

Barplot per each ax in matplotlib

I have the following dataset, ratings in stars for two fictitious places:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B','B','B','B','B','B'],
'rating':[1,2,4,5,5,5,3,1,3,3,3,5,2]})
Since the rating is a category (is not a continuous data) I convert it to a category:
df['rating_cat'] = pd.Categorical(df['rating'])
What I want is to create a bar plot per each fictitious place ('A or B'), and the count per each rating. This is the intended plot:
I guess using a for per each value in id could work, but I have some trouble to decide the size:
fig, ax = plt.subplots(1,2,figsize=(6,6))
axs = ax.flatten()
cats = df['rating_cat'].cat.categories.tolist()
ids_uniques = df.id.unique()
for i in range(len(ids_uniques)):
ax[i].bar(df[df['id']==ids_uniques[i]], df['rating'].size())
But it returns me an error TypeError: 'int' object is not callable
Perhaps it's something complicated what I am doing, please, could you guide me with this code
The pure matplotlib way:
from math import ceil
# Prepare the data for plotting
df_plot = df.groupby(["id", "rating"]).size()
unique_ids = df_plot.index.get_level_values("id").unique()
# Calculate the grid spec. This will be a n x 2 grid
# to fit one chart by id
ncols = 2
nrows = ceil(len(unique_ids) / ncols)
fig = plt.figure(figsize=(6,6))
for i, id_ in enumerate(unique_ids):
# In a figure grid spanning nrows x ncols, plot into the
# axes at position i + 1
ax = fig.add_subplot(nrows, ncols, i+1)
df_plot.xs(id_).plot(axes=ax, kind="bar")
You can simplify things a lot with Seaborn:
import seaborn as sns
sns.catplot(data=df, x="rating", col="id", col_wrap=2, kind="count")
If you're ok with installing a new library, seaborn has a very helpful countplot. Seaborn uses matplotlib under the hood and makes certain plots easier.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B','B','B','B','B','B'],
'rating':[1,2,4,5,5,5,3,1,3,3,3,5,2]})
sns.countplot(
data = df,
x = 'rating',
hue = 'id',
)
plt.show()
plt.close()

Creating 3D scatter chart in Taipy

I was wondering how one would create a 3D scatter chart in Taipy.
I tried this code initially:
import pandas as pd
import numpy as np
from taipy import Gui
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=list('xyz'))
df['cluster1']=np.random.randint(0,3,100)
my_page ="""
Creation of a 3-D chart:
<|{df}|chart|type=Scatter3D|x=x|y=y|z=z|mode=markers|color=cluster|>
"""
Gui(page=my_page).run()
This does indeed display a 3D plot, but the colors (clusters) do not show up.
Any hint?
Yes, you need some massaging of your dataframes to do it.
Here's a sample code that achieves this:
import pandas as pd
import numpy as np
from taipy import Gui
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=list('xyz'))
df['cluster1']=np.random.randint(0,3,100)
# Create a list of 3 dataframes, one per cluster
datas = [df[df['cluster1']==i] for i in range(3)]
properties = {
}
# create dynamically the property list.
# str(i) points to a dataframe index
# "/x" points to the column value in the selected dataframe
for i in range(len(datas)):
properties[f"x[{i+1}]"] = str(i)+"/x"
properties[f"y[{i+1}]"] = str(i)+"/y"
properties[f"z[{i+1}]"] = str(i)+"/z"
properties[f'name[{i+1}]'] = str(i+1)
print(properties)
chart = "<|{datas}|chart|type=Scatter3D|properties={properties}|mode=markers|height=800px|>"
Gui(page=chart).run()
In fact, with the new release: Taipy 1.1, this is very easy to do in a few lines of code:
import pandas as pd
import numpy as np
from taipy import Gui
color_map={0:"blue",1:'green', 2:"red"}
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=list('xyz'))
df['cluster1'] = np.random.randint(0,3,100)
df['cluster_colors'] = df.apply(lambda row: color_map[row.cluster1], axis=1)
marker = {"color":"cluster_colors"}
chart = "<|{df}|chart|type=Scatter3D|x=x|y=y|z=z|marker={marker}|mode=markers|height=800px|>"
Gui(page=chart).run()
If you want to leave it to Taipy to pick the colors for you, then you can simply use:
import pandas as pd
import numpy as np
from taipy import Gui
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=list('xyz'))
df['cluster1'] = np.random.randint(0,3,100)
marker = {"color":"cluster1"}
chart = "<|{df}|chart|type=Scatter3D|x=x|y=y|z=z|marker={marker}|mode=markers|height=800px|>"
Gui(page=chart).run()

incorporating p-value into box or violin plot

I have made a violin plot with the code below (see pic for plot). I'm wondering if it's possible to get p-values of the differences between samples on the x axis. This could be any statistical test that shows a p-value, so if there is a global shift in the violin plot, the difference could be seen.
violin plot
Edit:
For clarity, was hoping to add something like this to show pvals between samples:
for i,p in enumerate(pvals):
if p>=0.05:
displaystring = r'n.s.'
elif p<0.0001:
displaystring = r'***'
elif p<0.001:
displaystring = r'**'
else:
displaystring = r'*'
Python code for making violin plot:
#!/usr/bin/env python
"""
Usage: Run script in ~/snakemake_eclip/scripts, use help function to see which parameters are needed.
This script takes in the all_reads_matrix made by merge_matrix.py and creates a violin plot.
"""
import pandas as pd
import argparse
import matplotlib.pyplot as plt
import os
import seaborn as sns
import numpy as np
plt.switch_backend('agg')
from scipy import stats
import numpy as np
def make_violin(in_matrix, save_path):
df = pd.read_csv(str(in_matrix), index_col=False)
# remove outliers
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
df = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]
# drop zeros
df = df[(df != 0).all(1)]
df = df.iloc[:, 1:].transform(lambda x: np.log(x / x.sum()))
print(df)
plt.figure(figsize=(20, 10), dpi=300)
sns.violinplot(data=df)
plt.plot()
plt.title("Read Counts of Individual ENSG")
plt.xlabel("Samples")
plt.ylabel("Log Transformed Normalized Read Count")
plt.savefig(os.path.join(str(save_path), 'all_reads_matrix_violin_plot_norm_log.pdf'))
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Create a violin plot from all_reads_matrix.csv')
parser.add_argument("--in_matrix",
help='name of input matrix')
parser.add_argument("--save_path",
help='path to save')
# parse out arguments
args = parser.parse_args()
# mutate matrix columns
make_violin(args.in_matrix, args.save_path)

How to set x axis according to the numbers in the DATAFRAME

i am using Matplotlib to show graph of some information that i get from the users,
i want to show it as:axis x will be by the ID of the users and axis y will be by the Winning time that whey have..
I dont understand how can i put the x axis index as the ID of my users.
my code:
import matplotlib.pyplot as plt
import matplotlib,pylab as pylab
import pandas as pd
import numpy as np
#df = pd.read_csv('Players.csv')
df = pd.read_json('Players.json')
# df.groupby('ID').sum()['Win']
axisx = df.groupby('ID').sum()['Win'].keys()
axisy = df.groupby('ID').sum()['Win'].values
fig = pylab.gcf()
# fig.canvas.set_window_title('4 In A Row Statistic')
# img = plt.imread("Oi.jpeg")
# plt.imshow(img)
fig, ax = plt.subplots()
ax.set_xticklabels(axisx.to_list())
plt.title('Game Statistic',fontsize=20,color='r')
plt.xlabel('ID Players',color='r')
plt.ylabel('Wins',color='r')
x = np.arange(len(axisx))
rects = ax.bar(x, axisy, width=0.1)
plt.show()
use plt.xticks(array_of_id). xticks can set the current tick locations and labels of the x-axis.

Figures names in Pandas Boxplots

I created 2 boxplots using pandas.
Then each figure gets referenced with plt.gcf()
When trying to show the plots, only the last boxplot gets shown. Its like fig1 is getting overwritten.
What is the correct way of showing both boxplots?
This is the sample code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
dates = pd.date_range('20000101', periods=10)
df = pd.DataFrame(index=dates)
df['A'] = np.cumsum(np.random.randn(10))
df['B'] = np.random.randint(-1,2,size=10)
df['C'] = range(1,11)
df['D'] = range(12,22)
# first figure
ax_boxplt1 = df[['A','B']].boxplot()
fig1 = plt.gcf()
# second figure
ax_boxplt2 = df[['C','D']].boxplot()
fig2 = plt.gcf()
# print figures
figures = [fig1,fig2]
for fig in figures:
print(fig)
Create a figure with two axes and plot to each of them separately
fig, axes = plt.subplots(2)
dates = pd.date_range('20000101', periods=10)
df = pd.DataFrame(index=dates)
df['A'] = np.cumsum(np.random.randn(10))
df['B'] = np.random.randint(-1,2,size=10)
df['C'] = range(1,11)
df['D'] = range(12,22)
# first figure
df[['A','B']].boxplot(ax=axes[0]) # Added `ax` parameter
# second figure
df[['C','D']].boxplot(ax=axes[1]) # Added `ax` parameter
plt.show()
In order to get two figures, define the figure before plotting to it. You can use a number enumerate the figures.
plt.figure(1)
# do something with the first figure
plt.figure(2)
# do something with the second figure
Complete example:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
dates = pd.date_range('20000101', periods=10)
df = pd.DataFrame(index=dates)
df['A'] = np.cumsum(np.random.randn(10))
df['B'] = np.random.randint(-1,2,size=10)
df['C'] = range(1,11)
df['D'] = range(12,22)
# first figure
fig1=plt.figure(1)
ax_boxplt1 = df[['A','B']].boxplot()
# second figure
fig2=plt.figure(2)
ax_boxplt2 = df[['C','D']].boxplot()
plt.show()