I am trying to create a bar plot using pandas. I have the following code:
import pandas as pd
indexes = ['Strongly agree', 'Agree', 'Neutral', 'Disagree', 'Strongly disagree']
df = pd.DataFrame({'Q7': [10, 11, 1, 0, 0]}, index=indexes)
df.plot.bar(indexes, df['Q7'].values)
By my reckoning this should work but I get a weird KeyError: 'Strongly agree' thrown at me. I can't figure out why this won't work.
By invoking plot as a Pandas method, you're referring to the data structures of df to make your plot.
The way you have it set up, with index=indexes, your bar plot's x values are stored in df.index. That's why Wen's suggestion in the comments to just use df.plot.bar() will work, as Pandas automatically looks to use df.index as the x-axis in this case.
Alternately, you can specify column names for x and y. In this case, you can move indexes into a column with reset_index() and then call the new index column explicitly:
df.reset_index().plot.bar(x="index", y="Q7")
Either approach will yield the correct plot:
Related
Is there a parameter to force horizontal labels in an mplstyle file? and/or using rcParams?
I'm currently using ax.xaxis.set_tick_params(rotation=0) at plot construction. I'd like a permanent style or setting. Thanks!
Default look (with x_compat=True in a pandas dataframes):
Desired look:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'Date': {0: '1950-01-01', 1: '1960-01-02', 2: '1970-01-03', 3: '1980-01-04', 4: '1990-01-05'}, 'Value': {0 : 0, 1: 1, 2: 0, 3: 1, 4: 0}})
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df = df.set_index('Date', drop=False)
f, ax = plt.subplots()
df.plot(ax=ax, x='Date', x_compat=True)
#ax.xaxis.set_tick_params(rotation=0)
plt.show()
I looked in there, but may have missed it:
customizing-with-matplotlibrc-files
matplotlib_configuration_api.html
Use parameter rot from df.plot
df.plot(ax=ax, x='Date', x_compat=True, rot=0)
I'll answer my own question to put the matter to rest.
No, there isn't.
[as of January 2022] There is no way to control tick label rotation via a style. This is because the pandas plot wrapper resets the rotation parameter. To quote from pandas/doc/source/user_guide/visualization.rst,
pandas includes automatic tick resolution adjustment for regular
frequency time-series data. For limited cases where pandas cannot
infer the frequency information (e.g., in an externally created
twinx), you can choose to suppress this behavior for alignment
purposes.
[...]
Using the x_compat parameter, you can suppress this behavior
Despite the wording here --- namely "alignment purposes" ---, setting x_compat=True does not reset the rotation parameter back to its matplotlib default of 0, as I'd incorrectly expected.
There seem to be mainly two ways around this:
Use matplotlib directly without pandas.
Reset the rotation inside the pandas plot call. This may be done
the pandas way [See Vishnudev's answer] with df.plot(... rot=0...) or the matplotlib way [See my OP] with an axis object
setting ax.xaxis.set_tick_params(rotation=0).
Source and Thanks to: Jody Klymak in comments and Marco Gorelli at Github.
When using seaborn, is there a way I can include multiple variables (columns) for the hue parameter? Another way to ask this question would be how can I group my data by multiple variables before plotting them on a single x,y axis plot?
I want to do something like below. However currently I am not able to specify two variables for the hue parameter.:
sns.relplot(x='#', y='Attack', hue=['Legendary', 'Stage'], data=df)
For example, assume I have a pandas DataFrame like below containing an a Pokemon database obtained via this tutorial.
I want to plot on the x-axis the pokedex #, and the y-axis the Attack. However, I want to data to be grouped by both Stage and Legendary. Using matplotlib, I wrote a custom function that groups the dataframe by ['Legendary','Stage'], and then iterates through each group for the plotting (see results below). Although my custom function works as intended, I was hoping this can be achieved simply by seaborn. I am guessing there must be other people what have attempted to visualize more than 3 variables in a single plot using seaborn?
fig, ax = plt.subplots()
grouping_variables = ['Stage','Legendary']
group_1 = df.groupby(grouping_variables)
for group_1_label, group_1_df in group_1:
ax.scatter(group_1_df['#'], group_1_df['Attack'], label=group_1_label)
ax_legend = ax.legend(title=grouping_variables)
Edit 1:
Note: In the example I provided, I grouped the data by obly two variables (ex: Legendary and Stage). However, other situations may require arbitrary number of variables (ex: 5 variables).
You can leverage the fact that hue accepts either a column name, or a sequence of the same length as your data, listing the color categories to assign each data point to. So...
sns.relplot(x='#', y='Attack', hue='Stage', data=df)
... is basically the same as:
sns.relplot(x='#', y='Attack', hue=df['Stage'], data=df)
You typically wouldn't use the latter, it's just more typing to achieve the same thing -- unless you want to construct a custom sequence on the fly:
sns.relplot(x='#', y='Attack', data=df,
hue=df[['Legendary', 'Stage']].apply(tuple, axis=1))
The way you build the sequence that you pass via hue is entirely up to you, the only requirement is that it must have the same length as your data, and if an array-like, it must be one-dimensional, so you can't just pass hue=df[['Legendary', 'Stage']], you have to somehow concatenate the columns into one. I chose tuple as the simplest and most versatile way, but if you want to have more control over the formatting, build a Series of strings. I'll save it into a separate variable here for better readability and so that I can assign it a name (which will be used as the legend title), but you don't have to:
hue = df[['Legendary', 'Stage']].apply(
lambda row: f"{row.Legendary}, {row.Stage}", axis=1)
hue.name = 'Legendary, Stage'
sns.relplot(x='#', y='Attack', hue=hue, data=df)
To use hue of seaborn.relplot, consider concatenating the needed groups into a single column and then run the plot on new variable:
def run_plot(df, flds):
# CREATE NEW COLUMN OF CONCATENATED VALUES
df['_'.join(flds)] = pd.Series(df.reindex(flds, axis='columns')
.astype('str')
.values.tolist()
).str.join('_')
# PLOT WITH hue
sns.relplot(x='#', y='Attack', hue='_'.join(flds), data=random_df, aspect=1.5)
plt.show()
plt.clf()
plt.close()
To demonstrate with random data
Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
### DATA
np.random.seed(22320)
random_df = pd.DataFrame({'#': np.arange(1,501),
'Name': np.random.choice(['Bulbasaur', 'Ivysaur', 'Venusaur',
'Charmander', 'Charmeleon'], 500),
'HP': np.random.randint(1, 100, 500),
'Attack': np.random.randint(1, 100, 500),
'Defense': np.random.randint(1, 100, 500),
'Sp. Atk': np.random.randint(1, 100, 500),
'Sp. Def': np.random.randint(1, 100, 500),
'Speed': np.random.randint(1, 100, 500),
'Stage': np.random.randint(1, 3, 500),
'Legend': np.random.choice([True, False], 500)
})
Plots
run_plot(random_df, ['Legend', 'Stage'])
run_plot(random_df, ['Legend', 'Stage', 'Name'])
In seaborn's scatterplot(), you can combine both a hue= and a style= parameter to produce different markers and different colors for each combinations
example (taken verbatim from the documentation):
tips = sns.load_dataset("tips")
ax = sns.scatterplot(x="total_bill", y="tip", data=tips)
ax = sns.scatterplot(x="total_bill", y="tip",
hue="day", style="time", data=tips)
I am creating a dask dataframe from a pandas dataframe using the from_pandas() function. When I try to select two columns from the dask dataframe using the square brackets [[ ]], I am getting a KeyError.
According to dask documentation, the dask dataframe supports the square bracket column selection like the pandas dataframe.
# data is a pandas dataframe
dask_df = ddf.from_pandas(data, 30)
data = data[dask_df[['length', 'country']].apply(
lambda x: myfunc(x, countries),
meta=('Boolean'),
axis=1
).compute()].reset_index(drop=True)
This is the error I am getting:
KeyError: "None of [Index(['length', 'country'], dtype='object')] are in the [columns]"
I was thinking that this might be something to do with providing the correct meta for the apply, but from the error it seems like the dask dataframe is not able to select the two columns, which should happen before the apply.
This works perfectly with if I replace "dask_df" with "data"(pandas df) in the apply line.
Is the index not being preserved when I am doing the from_pandas?
Try loading less data at once.
I had the same issue, but when I loaded only a subset of my data, it worked.
With the large dataset, I was able to run print(dask_df.columns) and see e.g.
Index(['apple', 'orange', 'pear'], dtype='object', name='fruit').
But when I ran dask_df.compute I would get KeyError: "None of [Index(['apple', 'orange', 'pear'], dtype='object')] are in the [columns]".
I knew that the data set was too big for my memory, and was trying dask hoping it would just figure it out for me =) I guess I have more work to do, but in any case I am glad to be in dask!
As the error states: columns ['length', 'country']
do not exist in dask_df.
Create them first than run your function.
I have a pandas dataframe with two columns of time series data. In my actual data, these columns are large enough that the render is unwieldy without datashader. I am attempting to compare events from these two timeseries. However, I need to be able to tell which data point is from which column. A simple functional example is below. How would I get columns A and B to use different color maps?
import numpy as np
import hvplot.pandas
import pandas as pd
A = np.random.randint(10, size=10000)
B = np.random.randint(30, size=10000)
d = {'A':A,'B':B}
df = pd.DataFrame(d)
df.hvplot(kind='scatter',datashade=True, height=500, width=1000, dynspread=False)
You will have to use the count_cat aggregator that counts each category separately, e.g. in the example above that would look like this:
import datashader as ds
df.hvplot(kind='scatter', aggregator=ds.count_cat('Variable'), datashade=True,
height=500, width=1000)
The 'Variable' here corresponds to the default group_label that hvplot assigns to the columns. If you provided a different group_label you would have to update the aggregator to match. However instead of supplying an aggregator explicitly you can also use the by keyword:
df.hvplot(kind='scatter', by='Variable', datashade=True,
height=500, width=1000)
Once hvplot 0.3.1 is released you'll also be able to supply an explicit cmap, e.g.:
df.hvplot(kind='scatter', by='Variable', datashade=True,
height=500, width=1000, cmap={'A': 'red', 'B': 'blue'})
I have been trying to plot some time series graphs using the pandas dataframe plot function. I was trying to add markers at some arbitrary points on the plot to show anomalous points. The code I used :
df1 = pd.DataFrame({'Entropy Values' : MeanValues}, index=DateRange)
df1.plot(linestyle = '-')
I have a list of Dates on which I need to add markers.Such as:
Dates = ['15:45:00', '15:50:00', '15:55:00', '16:00:00']
I had a look at this link matplotlib: Set markers for individual points on a line. Does DF.plot have a similar functionality?
I really appreciate the help. Thanks!
DataFrame.plot passes all keyword arguments it does not recognize to the matplotlib plotting method. To put markers at a few points in the plot you can use the markevery argument. Here is an example:
import pandas as pd
df = pd.DataFrame({'A': range(10), 'B': range(10)}).set_index('A')
df.plot(linestyle='-', markevery=[1, 5, 7, 8], marker='o', markerfacecolor='r')
In your case, you would have to do something like
df1.plot(linestyle='-', markevery=Dates, marker='o', markerfacecolor='r')