Distribution probabilities for each column data frame, in one plot

Distribution probabilities for each column data frame, in one plot - pandas

I am creating probability distributions for each column of my data frame by distplot from seaborn library sns.distplot(). For one plot I do
x = df['A']
sns.distplot(x);
I am trying to use the FacetGrid & Map to have all plots for each columns at once
in this way. But doesn't work at all.
g = sns.FacetGrid(df, col = 'A','B','C','D','E')
g.map(sns.distplot())

I think you need to use melt to reshape your dataframe to long format, see this MVCE:
df = pd.DataFrame(np.random.random((100,5)), columns = list('ABCDE'))
dfm = df.melt(var_name='columns')
g = sns.FacetGrid(dfm, col='columns')
g = (g.map(sns.distplot, 'value'))
Output:
From seaborn 0.11.2 it is not recommended to use FacetGrid directly. Instead, use sns.displot for figure-level plots.
np.random.seed(2022)
df = pd.DataFrame(np.random.random((100,5)), columns = list('ABCDE'))
dfm = df.melt(var_name='columns')
g = sns.displot(data=dfm, x='value', col='columns', col_wrap=3, common_norm=False, kde=True, stat='density')

You're getting this wrong on two levels.
Python syntax.
FacetGrid(df, col = 'A','B','C','D','E') is invalid, because col gets set to A and the remaining characters are interpreted as further arguments. But since they are not named, this is invalid python syntax.
Seaborn concepts.
Seaborn expects a single column name as input for the col or row argument. This means that the dataframe needs to be in a format that has one column which determines to which column or row the respective datum belongs.
You do not call the function to be used by map. The idea is of course that map itself calls it.
Solutions:
Loop over columns:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(np.random.randn(14,5), columns=list("ABCDE"))
fig, axes = plt.subplots(ncols=5)
for ax, col in zip(axes, df.columns):
sns.distplot(df[col], ax=ax)
plt.show()
Melt dataframe
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(np.random.randn(14,5), columns=list("ABCDE"))
g = sns.FacetGrid(df.melt(), col="variable")
g.map(sns.distplot, "value")
plt.show()

You can use the following:
# listing dataframes types
list(set(df.dtypes.tolist()))
# include only float and integer
df_num = df.select_dtypes(include = ['float64', 'int64'])
# display what has been selected
df_num.head()
# plot
df_num.hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8);

I think the easiest approach is to just loop the columns and create a plot.
import numpy as np
improt pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.random((100,5)), columns = list('ABCDE'))
for col in df.columns:
hist = df[col].hist(bins=10)
print("Plotting for column {}".format(col))
plt.show()

Related

List comprehension while plotting graph from several columns

I am trying to plot a line graph from several columns
ax = sns.lineplot(data=mt,
x= ['pt'],
y = [c for c in mt.columns if c not in ['pt']],
dashes=False)
The response I am getting is
ValueError: Length of list vectors must match length of `data` when both are used, but `data` has length 13 and the vector passed to `x` has length 1.

Seaborn's prefers data in long form, which can be created via pd.melt(). A wide form dataframe is supported if you create an index (and the data isn't too complex).
Here is a simple example:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
mt = pd.DataFrame({'pt': np.arange(100),
'y1': np.random.randn(100).cumsum(),
'y2': np.random.randn(100).cumsum(),
'y3': np.random.randn(100).cumsum()})
sns.set()
ax = sns.lineplot(data=mt.set_index('pt'), dashes=True)
plt.tight_layout()
plt.show()

Ploting dataframe with NAs with linearly joined points

I have a dataframe where each column has many missing values. How can I make a plot where the datapoints in each column are joined with lines, i.e. NAs are ignored, instead of having a choppy plot?
import numpy as np
import pandas as pd
pd.options.plotting.backend = "plotly"
d = pd.DataFrame(data = np.random.choice([np.nan] + list(range(7)), size=(10,3)))
d.plot(markers=True)
One way is to use this for each column:
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y, name="linear",
line_shape='linear'))
Are there any better ways to accomplish this?

You can use pandas interpolate. Have demonstrated using plotly express and chained use so underlying data is not changed.
Post comments have amended answer so that markers are not shown for interpreted points.
import numpy as np
import pandas as pd
import plotly.express as px
d = pd.DataFrame(data=np.random.choice([np.nan] + list(range(7)), size=(10, 3)))
px.line(d).update_traces(mode="lines+markers").add_traces(
px.line(d.interpolate(limit_direction="both")).update_traces(showlegend=False).data
)

how to assign different markers to the max value found in each column in the plot

How would I assign markers of different symbols to each of the max values found in each curve?, Ie, 4 different markers showing the max value in each curve.
Here is my attempt
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df = pd.DataFrame(np.random.randint(0,1000,size=(100, 4)), columns=list('ABCD'))
maxValues=df.max()
m=['o', '.', ',', 'x',]
df.plot()
plt.plot(maxValues, marker=m)
In my real df, the number of columns will vary.

You can do it this way. Note that I used a V instead of , as the comma (pixel) wasn't showing up clearly.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df = pd.DataFrame(np.random.randint(0,1000,size=(100, 4)), columns=list('ABCD'))
df.plot(figsize=(20,5))
mrk = pd.DataFrame({'A': [df[['A']].idxmax()[0], df['A'].max(), 'o'],
'B': [df[['B']].idxmax()[0], df['B'].max(), '.'],
'C': [df[['C']].idxmax()[0], df['C'].max(), 'v'],
'D': [df[['D']].idxmax()[0], df['D'].max(), 'x']})
for col in range(len(mrk.columns)):
plt.plot(mrk.iloc[0,col], mrk.iloc[1, col], marker=mrk.iloc[2, col], markersize=20)
I created the mrk dataframe manually as it was small, but you can use loops to go through the various columns in your real data. The graph looks like this. Adjust markersize to increase/decrease size of the markers.

How to plot a grid of histograms with Matplotlib in the order of the DataFrame columns?

Considers the simple data frame below:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'var3':[1,3,9,6,1,6,3,1,1,3],
'var1':[9,1,2,6,6,5,9,3,1,7],
'var2':[6,6,2,9,8,3,5,4,1,3]})
df
Now, let's plot a set of histograms from this data:
df.hist(layout=(1,3))
plt.show()
Note that the order (from left to right) of the histograms in the figure is different from the order of the columns in the data frame. How to make the histograms obey the order of its data source?

I could not find a way to do that within the df.hist() function. But you can accomplish it with the simple loop below:
fig, ax = plt.subplots(1, len(df.columns), figsize=(3*len(df.columns), 3))
for i, var in enumerate(df):
df[var].hist(ax=ax[i])
ax[i].set_title(var)
plt.show()
Result:

I like #foglerit's answer, but here's another workaround solution:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'var3':[1,3,9,6,1,6,3,1,1,3],
'var1':[9,1,2,6,6,5,9,3,1,7],
'var2':[6,6,2,9,8,3,5,4,1,3]})
columns = df.columns # save original column names
columns_temp = [] # create temporary column names, numbered
for i, col in enumerate(df.columns):
columns_temp.append('(' + str(i+1) + ') ' + str(col))
df.columns = columns_temp
df.hist(layout=(1,3)) # now the column order is not messed up
df.columns = columns # reassign original column names

plotting a pandas dataframe row by row

I have the following dataframe:
I want to create pie charts one for each row, the thing is that i am having trouble with the charts order, i want each chart to have a figsize of lets say 5,5 and that every row in my dataframe will be a row of plot in my subplots with the index as title.
tried many combinations and playing with pyploy.subplots but not success.
would be glad for some help.
Thanks

You can either transpose your dataframe and using pandas pie kind for plotting, i.e. df.transpose().plot(kind='pie', subplots=True) or iterate through rows while sub plotting.
An example using subplots:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Recreate a similar dataframe
rows = ['rows {}'.format(i) for i in range(5)]
columns = ['hits', 'misses']
col1 = np.random.random(5)
col2 = 1 - col1
data = zip(col1, col2)
df = pd.DataFrame(data=data, index=rows, columns=columns)
# Plotting
fig = plt.figure(figsize=(15,10))
for i, (name, row) in enumerate(df.iterrows()):
ax = plt.subplot(2,3, i+1)
ax.set_title(row.name)
ax.set_aspect('equal')
ax.pie(row, labels=row.index)
plt.show()

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Distribution probabilities for each column data frame, in one plot - pandas

You can use the following: # listing dataframes types list(set(df.dtypes.tolist())) # include only float and integer df_num = df.select_dtypes(include = ['float64', 'int64']) # display what has been selected df_num.head() # plot df_num.hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8);

Related

List comprehension while plotting graph from several columns

Ploting dataframe with NAs with linearly joined points

how to assign different markers to the max value found in each column in the plot

How to plot a grid of histograms with Matplotlib in the order of the DataFrame columns?

plotting a pandas dataframe row by row

Categories

Resources