FacetGrid plot with aggregate in Seaborn/other library - pandas

I've toy-dataframe like this:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame({'cat': ['a', 'a', 'a', 'b', 'b', 'b'], 'n1': [1,1,1,4,5,6], 'n2': [6,5,2,2,2,1]})
I want to groupby by cat and plot histograms for n1 and n2, additionally I want to plot those histograms without grouping, so first, transform data to seaborn format:
df2 = pd.melt(df, id_vars='cat', value_vars=['n1', 'n2'], value_name='value')
second add "all":
df_all = df2.copy()
df_all['cat'] = 'all'
df3 = pd.concat([df2, df_all])
Finally plot:
g = sns.FacetGrid(df2, col="variable", row="cat")
g.map(plt.hist, 'value', ec="k")
I wonder, if it could be done in more elegant, concise way, without creating df3 or df2. Different library could be used.

As I mentioned in my comment, I think what you do is perfectly fine. Craft a function if needed to perform often. Nevertheless, you might be interested in pandas_profiling. This describes in detail the profile of your data, and in an interactive way. In my opinion, this is probably overkill for what you want to do, but I'll let you be the judge of that ;)
import pandas_profiling
df.profile_report()
Extract of the interactive output:

Related

Ploting dataframe with NAs with linearly joined points

I have a dataframe where each column has many missing values. How can I make a plot where the datapoints in each column are joined with lines, i.e. NAs are ignored, instead of having a choppy plot?
import numpy as np
import pandas as pd
pd.options.plotting.backend = "plotly"
d = pd.DataFrame(data = np.random.choice([np.nan] + list(range(7)), size=(10,3)))
d.plot(markers=True)
One way is to use this for each column:
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y, name="linear",
line_shape='linear'))
Are there any better ways to accomplish this?
You can use pandas interpolate. Have demonstrated using plotly express and chained use so underlying data is not changed.
Post comments have amended answer so that markers are not shown for interpreted points.
import numpy as np
import pandas as pd
import plotly.express as px
d = pd.DataFrame(data=np.random.choice([np.nan] + list(range(7)), size=(10, 3)))
px.line(d).update_traces(mode="lines+markers").add_traces(
px.line(d.interpolate(limit_direction="both")).update_traces(showlegend=False).data
)

how to assign different markers to the max value found in each column in the plot

How would I assign markers of different symbols to each of the max values found in each curve?, Ie, 4 different markers showing the max value in each curve.
Here is my attempt
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df = pd.DataFrame(np.random.randint(0,1000,size=(100, 4)), columns=list('ABCD'))
maxValues=df.max()
m=['o', '.', ',', 'x',]
df.plot()
plt.plot(maxValues, marker=m)
In my real df, the number of columns will vary.
You can do it this way. Note that I used a V instead of , as the comma (pixel) wasn't showing up clearly.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df = pd.DataFrame(np.random.randint(0,1000,size=(100, 4)), columns=list('ABCD'))
df.plot(figsize=(20,5))
mrk = pd.DataFrame({'A': [df[['A']].idxmax()[0], df['A'].max(), 'o'],
'B': [df[['B']].idxmax()[0], df['B'].max(), '.'],
'C': [df[['C']].idxmax()[0], df['C'].max(), 'v'],
'D': [df[['D']].idxmax()[0], df['D'].max(), 'x']})
for col in range(len(mrk.columns)):
plt.plot(mrk.iloc[0,col], mrk.iloc[1, col], marker=mrk.iloc[2, col], markersize=20)
I created the mrk dataframe manually as it was small, but you can use loops to go through the various columns in your real data. The graph looks like this. Adjust markersize to increase/decrease size of the markers.

Running df.apply, dask and pd.get_dummies together

I have multiple categorical columns with millions of distinct values in these categorical columns. So, I am using dask and pd.get_dummies for converting these categorical columns into bit vectors. Like this:
import pandas as pd
import numpy as np
import scipy.sparse
import dask.dataframe as dd
import multiprocessing
train_set = pd.read_csv('train_set.csv')
def convert_into_one_hot (col1, col2):
return pd.get_dummies(train_set, columns=[col1, col2], sparse=True)
ddata = dd.from_pandas(train_set, npartitions=2*multiprocessing.cpu_count()).map_partitions(lambda df: df.apply((lambda row: convert_into_one_hot(row.col1, row.col2)), axis=1)).compute(scheduler='processes')
But, I get this error:
ValueError: Metadata inference failed in `lambda`.
You have supplied a custom function and Dask is unable to determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
------------------------
KeyError("None of [Index(['foo'], dtype='object')] are in the [columns]")
What am I doing wrong here? Thanks.
EDIT:
A small example to reproduce the error. Hope it helps to understand the problem.
def convert_into_one_hot (x, y):
return pd.get_dummies(df, columns=[x, y], sparse=True)
d = {'col1': ['a', 'b'], 'col2': ['c', 'd']}
df = pd.DataFrame(data=d)
dd.from_pandas(df, npartitions=2*multiprocessing.cpu_count()).map_partitions(lambda df: df.apply((lambda row: convert_into_one_hot(row.col1, row.col2)), axis=1)).compute(scheduler='processes')
I think you could have some problems if you try to use get_dummies within partitions. there is a dask version for this and should work as following
import pandas as pd
import dask.dataframe as dd
import multiprocessing as mp
d = {'col1': ['a', 'b'], 'col2': ['c', 'd']}
df = pd.DataFrame(data=d)
Pandas
pd.get_dummies(df, columns=["col1", "col2"], sparse=True)
Dask
ddf = dd.from_pandas(df, npartitions=2 * mp.cpu_count())
# you need to converts columns dtypes to category
dummies_cols = ["col1", "col2"]
ddf[dummies_cols] = ddf[dummies_cols].categorize()
dd.get_dummies(ddf, columns=["col1", "col2"], sparse=True)

Stacked barplot in pandas- read from dataframe?

I am trying to create a stacked barplot using a data frame I have created that
looks like this
I want the stacked bar chart to show the 'types of exploitation' on the x axis, and then the male and female figures stacked on top of each other under these headings.
Is there a way to do this reading the info from my df? I have read about creating an index to do this but do not understand if this is the solution?
I also need a legend showing 'male' and 'female'
You can stack bars on top of eachother by the bottom function in matplotlib package.
Step 1: Create dataframe and import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rc
d = {'male': [37,1032,1], 'female': [96,134,1]}
df = pd.DataFrame(data=d, index=['a', 'b', 'c'])
Step 2: Create graph
r = [0,1,2]
bars1 = df['female']
bars2 = df['male']
plt.bar(r, bars1)
plt.bar(r, bars2,bottom=bars1, color='#557f2d')
plt.xticks(r, df.index, fontweight='bold')
plt.legend(labels = ['female', 'male'])
plt.show()
More information could be found on this webpage: Link

Matplotlib Bar Graph Yaxis not being set to 0 [duplicate]

My DataFrame's structure
trx.columns
Index(['dest', 'orig', 'timestamp', 'transcode', 'amount'], dtype='object')
I'm trying to plot transcode (transaction code) against amount to see the how much money is spent per transaction. I made sure to convert transcode to a categorical type as seen below.
trx['transcode']
...
Name: transcode, Length: 21893, dtype: category
Categories (3, int64): [1, 17, 99]
The result I get from doing plt.scatter(trx['transcode'], trx['amount']) is
Scatter plot
While the above plot is not entirely wrong, I would like the X axis to contain just the three possible values of transcode [1, 17, 99] instead of the entire [1, 100] range.
Thanks!
In matplotlib 2.1 you can plot categorical variables by using strings. I.e. if you provide the column for the x values as string, it will recognize them as categories.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({"x" : np.random.choice([1,17,99], size=100),
"y" : np.random.rand(100)*100})
plt.scatter(df["x"].astype(str), df["y"])
plt.margins(x=0.5)
plt.show()
In order to optain the same in matplotlib <=2.0 one would plot against some index instead.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({"x" : np.random.choice([1,17,99], size=100),
"y" : np.random.rand(100)*100})
u, inv = np.unique(df["x"], return_inverse=True)
plt.scatter(inv, df["y"])
plt.xticks(range(len(u)),u)
plt.margins(x=0.5)
plt.show()
The same plot can be obtained using seaborn's stripplot:
sns.stripplot(x="x", y="y", data=df)
And a potentially nicer representation can be done via seaborn's swarmplot:
sns.swarmplot(x="x", y="y", data=df)