Pandas Dataframe Create Seaborn Horizontal Barplot with categorical data - pandas

I'm currently working with a data frame like this:
What I want is to show the total numer of the Victory column where the value is S grouped by AGE_GROUP and differenced by GENDER, something like in the following horizontal barplot:
Until now I could obtain the following chart:
Following this steps:
victory_df = main_df[main_df["VICTORY"] == "S"]
victory_count = victory_df["AGE_GROUP"].value_counts()
sns.set(style="darkgrid")
sns.barplot(victory_count.index, victory_count.values, alpha=0.9)
Which strategy I should use to difference in the value_count by gender and include it in the chart?

It would obviously help giving raw data and not an image. Came up with own data.Not sure understood your question but my attempt below.
Data
df=pd.DataFrame.from_dict({'VICTORY':['S', 'S', 'N', 'N', 'N', 'S', 'N', 'S', 'N', 'S', 'N', 'S', 'S'],'AGE':[5., 88., 12., 19., 30., 43., 77., 50., 78., 34., 45., 9., 67.],'AGE_GROUP':['0-13', '65+', '0-13', '18-35', '18-35', '36-64', '65+', '36-64','65+', '18-35', '36-64', '0-13', '65+'],'GENDER':['M', 'M', 'F', 'M', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'F']})
Plotting. I groupby AGE_GROUP, value count GENDER, unstack and plot a stacked horizontal bar plot. Seaborn is build on matplotlib and when plotting is not straightforward in seaborn like the stacked horizontal bar, I fall back to matplotlib. Hope you dont take offence.
df[df['VICTORY']=='S'].groupby('AGE_GROUP')['GENDER'].apply(lambda x: x.value_counts()).unstack().plot(kind='barh', stacked=True)
plt.xlabel('Count')
plt.title('xxxx')
Output

Related

how do I set the max values for a plotly px.sunburst graph?

I am trying to show how much a student has completed from a set of challenges with a plotly sunburst graph. I want to have the 'category' maximum value be shown for each of them but only fill in the challenges that they've done. I was thinking of having the ones they did not do be greyed out. I have the max values for each of the challenges in the dataframe 'challenge_count_df' and the students work in the 'student_df':
import pandas as pd
import plotly.express as px
challenge_count_df = pd.DataFrame({'Challenge': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'Value' : ["5","5","10","15","5","10","5","10","15","10"],
'Category' : ['linux','primer','windows','linux','linux','primer','windows','linux', 'linux', 'primer']})
student_df = pd.DataFrame({'Challenge': ['B', 'C', 'E', 'F', 'G', 'H', 'I'],
'Value' : ["5","10","5","10","5","10","15"],
'Category' : ['primer','windows','linux','primer','windows','linux', 'linux']})
As you can see, the student_df has some of the challenges missing. That's because they didn't answer them.
I know how to start a starburst like this:
fig = px.starburst(challenge_count_df, path=['Category','Challenge'],values='Value')
Is there a way to overlap that with this?
fig = px.starburst(student_df, path=['Category','Challenge'],values='Value')

How can I print it out in this order: table, bar chart, table ...?

How can I print it out in this order: table, bar chart, table, bar chart, ...?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(100, 10),
columns=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
for column in df:
print(df[column].value_counts(normalize=True, bins=10))
print(df[column].hist(bins=10))
It prints all tables first. Then prints one joint bar chart. But I want to mix tables and bar charts.
What do you mean by tables? Are you doing plt.show() to get your plots?
for column in df:
print(df[column].value_counts(normalize=True, bins=10))
print(df[column].hist(bins=10))
plt.show()
Shows me the value value_counts with each individual plot. If you do it outside of the loop, the plots would just accumulate it unless you clear them.

Pyspark equivalent for groupby and aggregation

I have a pyspark dataframe and i am trying to perform groupby and aggregation on that.
I am performing the following operations in Pandas and its working fine:
new_df = new_df.groupBy('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'H', 'K', 'L', 'Cost1','Cost2','Cost3','Cost4','Cost5')
new_df = new_df.agg({'Cost1':sum, 'Cost2':sum, 'Cost3':sum,'Cost4':sum, 'Cost5':sum})
But i am unable to perform the same operations in Pyspark using the below syntax:
new_df = new_df.groupBy('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'H', 'K', 'L', 'Cost1','Cost2','Cost3','Cost4','Cost5').agg(F.sum(ost1','Cost2','Cost3','Cost4','Cost5'))
Error:
AttributeError: 'GroupedData' object has no attribute 'groupBy'
You have a typo here (ost1',. you forgot 'C. And the error relates to another problem in your code. You probably call groupBy() twice like this: groupBy("A").groupBy("B"). You can not do this. You should call one of aggregation function from GrouppedData object after groupBy(). I think, you need these code
new_df = df.groupBy("A", "B").sum("Cost1", "Cost2")
new_df.show()

How can I change the filled color of stacked area plot in DataFrame?

I want to change the filled color in the stacked area plots drawn with Pandas.Dataframe.
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
ax = df.plot.area(linewidth=0);
The area plot example
Now I guess that the instance return by the plot function offers the access to modifying the attributes like colors.
But the axes classes are too complicated to learn fast. And I failed to find similar questions in the Stack Overflow.
So can any master do me a favor?
Use 'colormap' (See the document for more details):
ax = df.plot.area(linewidth=0, colormap="Pastel1")
The trick is using the 'color' parameter:
Soln 1: dict
Simply pass a dict of {column name: color}
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'], )
ax = df.plot.area(color={'b':'0', 'c':'#17A589', 'a':'#9C640C', 'd':'#ECF0F1'})
Soln 2: sequence
Simply pass a sequence of color codes (it will match the order of your columns).
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'], )
ax = df.plot.area(color=('0', '#17A589', '#9C640C', '#ECF0F1'))
No need to set linewidth (it will automatically adjust colors). Also, this wouldn't mess with the legend.
The API of matplotlib is really complex, but here artist Module gives a very plain illustration. For the bar/barh plots, the attributes can be visited and modified by .patches, but for the area plot they need to be with .collections.
To achieve the specific modification, use codes like this.
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
ax = df.plot.area(linewidth=0);
for collection in ax.collections:
collection.set_facecolor('#888888')
highlight = 0
ax.collections[highlight].set_facecolor('#aa3333')
Other methods of the collections can be found by run
dir(ax.collections[highlight])

imshow: labels as any arbitrary function of the image indices

imshow plots a matrix against its column indices (x axis) and row indices (y axis). I would like the axes labels to not be indices, but an arbitrary function of the indices.
e.g. pitch detection
imshow(A, aspect='auto') where A.shape == (88200,8)
in the x-axis, shows several ticks at about [11000, 22000, ..., 88000]
in the y-axis, shows the frequency bin [0,1,2,3,4,5,6,7]
What I want is:
x-axis labeling are normalized from samples to seconds. For a 2 second audio at 44.1kHz sample rate, I want two ticks at [1,2].
y-axis labeling is the pitch as a note. i want the labels in the note of the pitch ['c', 'd', 'e', 'f', 'g', 'a', 'b'].
ideally:
imshow(A, ylabel=lambda i: freqs[i], xlabel=lambda j: j/44100)
You can do this with a combination of Locators and Formatters (doc).
ax = gca()
ax.imshow(rand(500,500))
ax.get_xaxis().set_major_formatter(FuncFormatter(lambda x,p :"%.2f"%(x/44100)))
ax.get_yaxis().set_major_locator(LinearLocator(7))
ax.get_yaxis().set_major_formatter(FixedFormatter(['c', 'd', 'e', 'f', 'g', 'a', 'b']))
draw()