Seaborn hue with loc condition - pandas

I'm facing the following problem: I'd like to create a lmplot with seaborn and I'd like to distinguish the colors not based on an existing column but based on a condition adressed to a column.
Given the following df for a rental price prediction:
area
rental price
year build
...
40
400
1990
...
60
840
1995
...
480
16
1997
...
...
...
...
...
sns.lmplot(x="area", y="rental price", data=df, hue = df.loc[df['year build'] > 1992])
this one above is not working. I know I can add a column representing this condition and adressing this column in "hue" but is there no way giving seaborn a condition to hue?
Thanks in advance!

You could add a new column with the boolean information and use that for the hue. For example data['at least from eighties'] = data['model_year'] >= 80. This will create a legend with the column name as title, and False and True as texts. If you map the values to strings, these will appear. Here is an example using one of seaborn's demo datasets:
import matplotlib.pyplot as plt
import seaborn as sns
df = sns.load_dataset('mpg')
df['decenium'] = (df['model_year'] >= 80).map({False: "seventies", True: "eighties"})
sns.lmplot(x='weight', y='mpg', data=df, hue='decenium')
plt.tight_layout()
plt.show()

Related

Iterating and ploting five columns per iteration pandas

I am trying to plot five columns per iteration, but current code is ploting everithing five times. How to explain to it to plot five columns per iteration without repeting them?
n=4
for tag_1,tag_2,tag_3,tag_4,tag_5 in zip(df.columns[n:], df.columns[n+1:], df.columns[n+2:], df.columns[n+3:], df.columns[n+4:]):
fig,ax=plt.subplots(ncols=5, tight_layout=True, sharey=True, figsize=(20,3))
sns.scatterplot(df, x=tag_1, y='variable', ax=ax[0])
sns.scatterplot(df, x=tag_2, y='variable', ax=ax[1])
sns.scatterplot(df, x=tag_3, y='variable', ax=ax[2])
sns.scatterplot(df, x=tag_4, y='variable', ax=ax[3])
sns.scatterplot(df, x=tag_5, y='variable', ax=ax[4])
plt.show()
You are using list slicing in the wrong way. When you use df.columns[n:], you are getting all the column names from the one with index n to the last one. The same is valid for n+1, n+2, n+3 and n+4. This causes the repetition that you are referring to. In addition to that, the fact that the plot is shown five times is due to the behavior of the zip function: when used on iterables with different sizes, the iterable returned by zip has the size of the smaller one (in this case df.columns[n+4:]).
You can achieve what you want by adapting your code as follows:
# Imports to create sample data
import string
import random
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Create some sample data and a sample dataframe
data = { string.ascii_lowercase[i]: [random.randint(0, 100) for _ in range(100)] for i in range(15) }
df = pd.DataFrame(data)
# Iterate in groups of five indexes
for start in range(0, len(df.columns), 5):
# Get the next five columns. Pay attention to the case in which the number of columns is not a multiple of 5
cols = [df.columns[idx] for idx in range(start, min(start+5, len(df.columns)))]
# Adapt your plot and take into account that the last group can be smaller than 5
fig,ax=plt.subplots(ncols=len(cols), tight_layout=True, sharey=True, figsize=(20,3))
for idx in range(len(cols)):
#sns.scatterplot(df, x=cols[idx], y='variable', ax=ax[idx])
sns.scatterplot(df, x=cols[idx], y=df[cols[idx]], ax=ax[idx]) # In the example the values of the column are plotted
plt.show()
In this case, the code performs the following steps:
Iterate over groups of at most five indexes ([0->4], [5->10]...)
Recover the columns that are positioned in the previously recovered indexes. The last group of columns may be smaller than 5 (e.g., 18 columns, the last is composed of the ones with the following indexes: 15, 16, 17
Create the plot taking into account the previous corner case of less than 5 columns
With Seaborn's object interface, available from v0.12, we might do like this:
from numpy import random
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import seaborn.objects as so
sns.set_theme()
First, let's create a sample dataset, just like trolloldem's answer.
random.seed(0) # To produce the same random values across multiple runs
columns = list("abcdefghij")
sample_size = 20
df_orig = pd.DataFrame(
{c: random.randint(100, size=sample_size) for c in columns},
index=pd.Series(range(sample_size), name="variable")
)
Then transform the data frame into a long-form for easier processing.
df = (df_orig
.melt(value_vars=columns, var_name="tag", ignore_index=False)
.reset_index()
)
Then finally render the figures, 5 figures per row.
(
so.Plot(df, x="value", y="variable") # Or you might do x="variable", y="value" instead
.facet(col="tag", wrap=5)
.add(so.Dot())
)

How to plot in pandas after groupby function

import pandas as pd
df = pd.read_excel(some data)
df2 = df.groupby(['Country', "Year"]).sum()
It looks like this:
Sales COGS Profit Month Number
Country Year
Canada 2013 3000
Canada 2014 3500
Other countries... other data
df3 = df2[[' Sales']]
I can plot it like this with the code:
df3.plot(kind="bar")
And it produces a chart
But I want to turn it into a line chart but my result from a simple plot is:
Stuck as to what one-liner will produce a chart that segments time on the x-axis but plots sales on y-axis with lines for different countries.
You have to stack Country column:
import matplotlib.pyplot as plt
df2 = df.groupby(['Country', 'Year'])['Sales'].sum().unstack('Country')
# Or df2.plot(title='Sales').set_xticks(df2.index)
ax = df2.plot(title='Sales')
ax.set_xticks(df2.index)
plt.show()
Output:

Creating a barplot in python seaborn with error bars showing standard deviation

I am new to python.
I am analyzing a dataset and need some help in plotting the barplot with error bars showing SD.
Check an example data set below at the following link https://drive.google.com/file/d/10JDr7d_vhEocWzChg-sfBEumsWVghFS8/view?usp=sharing
Here is the code that I am using;
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
df = pd.read_excel('Sample_data.xlsx')
#Adding a column 'Total' by adding all cell counts in each row
#This will give the cells counted in each sample
df['Total'] = df['Cell1'] + df['Cell2'] + df['Cell3'] + df['Cell4']
df
# Creating a pivot table based on Timepoint and cell types
phenotype = df.pivot_table (index = ['Timepoint'],
values=['Cell1',
'Cell2',
'Cell3',
'Cell4'],
aggfunc = np.sum,
margins = False)
phenotype
# plot different cell types grouped according to the timepoint and error bars = SD
sns.barplot(data = phenotype)
Now I am stuck in plotting cell types based on timepoint column and putting error bars = SD.
Any help is much appreciated.
Thanks.
If you swap the rows and columns from pivot, you get the format you want. Does this fit the intent of your question?
phenotype = df.pivot_table (index = ['Time point'],
values=['Cell1', 'Cell2', 'Cell3', 'Cell4'],
aggfunc = np.sum,
margins = False)
phenotype.reset_index()
phenotype = phenotype.stack().unstack(level=0)
phenotype
Time point 48 72 96
Cell1 54 395 57
Cell2 33 35 39
Cell3 1 3 9
Cell4 2 6 3
sns.boxplot(data = phenotype)

Replace xticks with names

I am working on the Spotify dataset from Kaggle. I plotted a barplot showing the top artists with most songs in the dataframe.
But the X-axis is showing numbers and I want to show names of the Artists.
names = list(df1['artist'][0:19])
plt.figure(figsize=(8,4))
plt.xlabel("Artists")
sns.barplot(x=np.arange(1,20),
y=df1['song_title'][0:19]);
I tried both list and Series object type but both are giving error.
How to replace the numbers in xticks with names?
Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Data
Data from Spotify - All Time Top 2000s Mega Dataset
df = pd.read_csv('Spotify-2000.csv')
titles = pd.DataFrame(df.groupby(['Artist'])['Title'].count()).reset_index().sort_values(['Title'], ascending=False).reset_index(drop=True)
titles.rename(columns={'Title': 'Title Count'}, inplace=True)
# titles.head()
Artist Title Count
Queen 37
The Beatles 36
Coldplay 27
U2 26
The Rolling Stones 24
Plot
plt.figure(figsize=(8, 4))
chart = sns.barplot(x=titles.Artist[0:19], y=titles['Title Count'][0:19])
chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
plt.show()
OK, so I didnt know this, although now it seems stupid not to do so in hindsight!
Pass names(or string labels) in the argument for X-axis.
use plt.xticks(rotate=90) so the labels don't overlap

Pandas Stacked Bar Plot - Columns by Max Value, Not Summed

%matplotlib inline
import matplotlib
matplotlib.style.use('ggplot')
import numpy as np
import pandas as pd
my_data = np.array([[ 0.110622 , 0.98174432, 0.56583323],
[ 0.61825694, 0.14166864, 0.44180003],
[ 0.02572145, 0.55764373, 0.24183103],
[ 0.98040318, 0.76171712, 0.41994361],
[ 0.49859658, 0.76637672, 0.75487683]])
pd.DataFrame(my_data).plot(kind='bar', stacked='true')
Using the above code I get:
How do I change this so that the hight of every bar is the max value for that bar instead of the sum, and so all the lower values for the bar are in the same bar as different colors?
Thanks for your help.
If I understood well your question, I would normalize your data multiplying each value by the current maximum and then divided by the sum of all elements. So that:
df = df.apply(lambda x: x*df.max(axis=1)/df.sum(axis=1))
where:
df = pd.DataFrame(my_data)
The new plot is:
df.plot(kind='bar', stacked='true')
Hope that helps.