Applying functions to DataFrame columns in plots - pandas

I'd like to apply functions to columns of a DataFrame when plotting them.
I understand that the standard way to plot when using Pandas is the .plot method.
How can I do math operations within this method, say for example multiply two columns in the plot?
Thanks!

Series actually have a plot method as well, so it should work to apply
(df['col1'] * df['col2']).plot()
Otherwise, if you need to do this more than once it would be the usual thing to make a new column in your dataframe:
df['newcol'] = df['col1'] * df['col2']

Related

Cannot plot a histogram from a Pandas dataframe

I've used pandas.read_csv to generate a 1000-row dataframe with 32 columns. I'm looking to plot a histogram or bar chart (depending on data type) of each column. For columns of type 'int64', I've tried doing matplotlib.pyplot.hist(df['column']) and df.hist(column='column'), as well as calling matplotlib.pyplot.hist on df['column'].values and df['column'].to_numpy(). Weirdly, nthey all take areally long time (>30s) and when I've allowed them to complet, I get unit-height bars in multiple colors, as if there's some sort of implicit grouping and they're all being separated into different groups. Any ideas about what I can do to get a normal histogram? Unfortunately I closed the charts so I can't show you an example right now.
Edit - this seems to be a much bigger problem with Int columns, and casting them to float fixes the problem.
Follow these two steps:
import the Histogram class from the Matplotlib library
use the "plot" method, which will accept a dataframe as argument
import matplotlib.pyplot as plt
plt.hist(df['column'], color='blue', edgecolor='black', bins=int(45/1))
Here's the source.

Pandas plot with only tow dates as index

I have this simple pandas dataframe with dates as index:
df=pd.DataFrame({'a':[20,30,12],'b':[15,18,18]},index=['2021-10-7','2021-10-8','2021-10-9']) df.index=pd.to_datetime(df.index)
when I try to plot a simple pandas.plot with only two dates in xaxis
df.iloc[-2:].plot()
it gives me the following plot with lot of numbers in axis
Plot works fine if I plot the entire db: db.plot()
Thank you for support
You can add below line after setting your index to make it work.
df.index.freq = 'D'
So Your entire code looks like this:
df=pd.DataFrame({'a':[20,30,12],'b':[15,18,18]},index=['2021-10-7','2021-10-8','2021-10-9'])
df.index = pd.to_datetime(df.index)
df.index.freq = 'D'
Alternatively:
You can also use date_range like below :
Please note this would work only if your data is like the one provided which has frequency of daily, You need to adjust in cases where the frequencies are different.
df=pd.DataFrame({'a':[20,30,12],'b':[15,18,18]},index=['2021-10-7','2021-10-8','2021-10-9'])
df.index=pd.date_range(start = '2021-10-07', end='2021-10-09')
Both approaches will give you same plot which you have mentioned in the question(similar to bottom one in the provided question)

Naming variable for median calculation using Numpy

I'm using numpy to get the median. The dataframe has two variables. Is there a way to tell it which variable I want the median for?
np.median(dataframename)
You must make cast your dataframe to numpy vector. Try this:
#input data in dataframename
dataframename = np.asarray(dataframename)
dataframename = dataframename.astype(float)
np.median(dataframename)
I realized that my data was not in a dataframe. Once I put it in, this worked.
dataframename.loc[:,"var18"].median()

create dask DataFrame from a list of dask Series

I need to create a a dask DataFrame from a set of dask Series,
analogously to constructing a pandas DataFrame from lists
pd.DataFrame({'l1': list1, 'l2': list2})
I am not seeing anything in the API. The dask DataFrame constructor is not supposed to be called by users directly and takes a computation graph as it's mainargument.
In general I agree that it would be nice for the dd.DataFrame constructor to behave like the pd.DataFrame constructor.
If your series have well defined divisions then you might try dask.dataframe.concat with axis=1.
You could also try converting one of the series into a DataFrame and then use assignment syntax:
L = # list of series
df = L[0].to_frame()
for s in L[1:]:
df[s.name] = s

Why does DataFrameGroupBy.boxplot method throw error when given argument "subplots=True/False"?

I can use DataFrameGroupBy.boxplot(...) to create a boxplot in the following way:
In [15]: df = pd.DataFrame({"gene_length":[100,100,100,200,200,200,300,300,300],
...: "gene_id":[1,1,1,2,2,2,3,3,3],
...: "density":[0.4,1.1,1.2,1.9,2.0,2.5,2.2,3.0,3.3],
...: "cohort":["USA","EUR","FIJ","USA","EUR","FIJ","USA","EUR","FIJ"]})
In [17]: df.groupby("cohort").boxplot(column="density",by="gene_id")
In [18]: plt.show()
This produces the following image:
This is exactly what I want, except instead of making three subplots, I want all the plots to be in one plot (with different colors for USA, EUR, and FIJ). I've tried
In [17]: df.groupby("cohort").boxplot(column="density",subplots=False,by="gene_id")
but it produces the error
KeyError: 'gene_id'
I think the problem has something to do with the fact that by="gene_id" is a keyword sent to the matplotlib boxplot method. If someone has a better way of producing the plot I am after, perhaps by using DataFrame.boxplot(?) instead, please respond here. Thanks so much!
To use the pure pandas functions, I think you should not GroupBy before calling boxplot, but instead, request to group by certain columns in the call to boxplot on the DataFrame itself:
df.boxplot(column='density',by=['gene_id','cohort'])
To get a better-looking result, you might want to consider using the Seaborn library. It is designed to help precisely with this sort of tasks:
sns.boxplot(data=df,x='gene_id',y='density',hue='cohort')
EDIT to take into account comment below
If you want to have each of your cohort boxplots stacked/superimposed for each gene_id, it's a bit more complicated (plus you might end up with quite an ugly output). You cannot do this using Seaborn, AFAIK, but you could with pandas directly, by using the position= parameter to boxplot (see doc). The catch it to generate the correct sequence of positions to place the boxplots where you want them, but you'll have to fix the tick labels and the legend yourself.
pos = [i for i in range(len(df.gene_id.unique())) for _ in range(len(df.cohort.unique()))]
df.boxplot(column='density',by=['gene_id','cohort'],positions=pos)
An alternative would be to use seaborn.swarmplot instead of using boxplot. A swarmplot plots every point instead of the synthetic representation of boxplots, but you can use the parameter split=False to get the points colored by cohort but stacked on top of each other for each gene_id.
sns.swarmplot(data=df,x='gene_id',y='density',hue='cohort', split=False)
Without knowing the actual content of your dataframe (number of points per gene and per cohort, and how separate they are in each cohort), it's hard to say which solution would be the most appropriate.