Multiple Line plot from a dataframe - pandas

from a dataframe I need to plot by month the count of events to know which ones are more likely to happen in a given month. I don know how to use the column Count.
df["MONTH"]= pd.to_datetime(df["BEGIN_DATE_TIME"], format = "%m/%d/%Y").dt.month
montly_events =df.groupby(["EVENT_TYPE", "MONTH"]).size().astype(int)
montly_events2 = montly_events.to_frame(name = "Count").reset_index()
plt.figure(figsize =(15,3))
sns.lineplot(x="MONTH", y="EVENT_TYPE", palette="ch:.25", data=df)

IIUC, count is another column in your df, try to make that as a 'y', the one that you have included here as screenshot
Try
sns.lineplot(x="MONTH", y="count", palette="ch:.25", data=df)

This was the solution I found using a barplot:
g = sns.FacetGrid(montly_ev2, row="EVENT_TYPE", hue="MONTH",palette="Set3", height=4, aspect=2)
g.map(sns.barplot, 'MONTH', 'Count', order=hours)

Related

how to access a specific result of a groupby grouper method grouped with frequency on quarters

df = pd.DataFrame(np.random.choice(pd.date_range('2019-10-01', '2022-10-31'), 15),
columns=['Date'])
df['NUM'] = np.random.randint(1, 600, df.shape[0])
df.groupby(pd.Grouper(key='Date', axis=0, freq='Q-DEC')).sum()
df_test = df.groupby(pd.Grouper(key='Date', axis=0, freq='Q-DEC')).sum()
df_test.iloc[-1]
So I can assign the groupby to a variable making another dataframe and then access any one entry of the groupby...in this instance the last entry. My question is can I avoid creating df_test altogether to access the last entry (or any entry I care to access)?
You can add .tail(1) at the end of groupby statement like this:
df.groupby(pd.Grouper(key='Date', axis=0, freq='Q-DEC')).sum().tail(1)

pandas df columns series

Have dataframe, and I have done some operations with its columns as follows
df1=sample_data.sort_values("Population")
df2=df1[(df1.Population > 500000) & (df1.Population < 1000000)]
df3=df2["Avg check"]*df2["Avg Daily Rides Last Week"]/df2["CAC"]
df4=df2["Avg check"]*df2["Avg Daily Rides Last Week"]
([[df3],[df4]])
If I understand right, then df3 & df4 now are series only, not dataframe. There should be a way to make a new dataframe with these Series and to plot scatter. Please advise. Thanks.
Wanted to add annotate for each and faced the issue
df3=df2["Avg check"]*df2["Avg Daily Rides Last Week"]/df2["CAC"]
df4=df2["Avg check"]*df2["Avg Daily Rides Last Week"]
df5=df2["Population"]
df6=df2["city_id"]
sct=plt.scatter(df5,df4,c=df3, cmap="viridis")
plt.xlabel("Population")
plt.ylabel("Avg check x Avg Daily Rides")
for i, txt in enumerate(df6):
plt.annotate(txt,(df4[i],df5[i]))
plt.colorbar()
plt.show()
I think you can pass both Series to matplotlib.pyplot.scatter:
import matplotlib.pyplot as plt
sc = plt.scatter(df3, df4)
EDIT: Swap df5 and df4 and for select by positions use Series.iat:
for i, txt in enumerate(df6):
plt.annotate(txt,(df5.iat[i],df4.iat[i]))
You can create a DataFrame from Series. Here is how to do it. Simply add both series in a dictionary
author = ['Jitender', 'Purnima', 'Arpit', 'Jyoti']
article = [210, 211, 114, 178]
auth_series = pd.Series(author)
article_series = pd.Series(article)
frame = { 'Author': auth_series, 'Article': article_series }
and then create a DataFrame from that dictionary:
result = pd.DataFrame(frame)
The code is from geeksforgeeks.org

Setting x-axis as Month from datetime arange of dataframe column using matplotlib plotting

How can I have the dates on the x-axis displayed as months, e.g. 'Jan', 'Feb'...'Dec'. instead of 2015-01, 2015-02,etc.
I'm reproducing a shortened version of my df.
I have read https://matplotlib.org/examples/api/date_demo.html and not sure how to apply it to my data.
Or, maybe when I'm mapping the dates to pd.to_datetime, I could somehow convert them to months only using ' %b' for the month names?
Or could use something along these lines,pd.date_range('2015-01-01','2016-01-01', freq='MS').strftime("%b").tolist(), in xticks?
Thank you!
import matplotlib.pyplot as plt
df1 = pd.DataFrame({'Date':['2015-01-01','2015-01-02','2015-01-03','2015-01-04','2015-01-05'],'Values':[-13.3,-12.2,6.7,8.8,15.5]})
df1.Date=pd.to_datetime(df1.Date)
df1['Day']=pd.DatetimeIndex(df1['Date']).day
df1['Month']=pd.DatetimeIndex(df1['Date']).month
df1=df1[['Month','Day','Values']]
df1
df2=pd.DataFrame({'Date':['2015-01-01','2015-01-02','2015-01-03','2015-01-04','2015-01-05'],'Values':[-5.6,-5.6,0,3.9,9.4]})
df2['Day']=pd.DatetimeIndex(df2['Date']).day
df2['Month']=pd.DatetimeIndex(df2['Date']).month
df2=df2[['Month','Day','Values']]
df2
plt.figure(figsize=(10,6))
ax = plt.gca()
dates = np.arange('2015-01-01', '2015-01-06', dtype='datetime64[D]')
dates=list(map(pd.to_datetime, dates))
plt.plot(dates, df1.Values, '-^',dates, df2.Values, '-o')
ax.set_xlim(dates[0],dates[4])
x = plt.gca().xaxis
for item in x.get_ticklabels():
item.set_rotation(45)
plt.show()
I have arrived at the following solution:
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))
perhaps, someone has a better one.

How to resample a dataframe with different functions applied to each column if we have more than 20 columns?

I know this question has been asked before. The answer is as follows:
df.resample('M').agg({'col1': np.sum, 'col2': np.mean})
But I have 27 columns and I want to sum the first 25, and average the remaining two. Should I write this('col1' - 'col25': np.sum) for 25 columns and this('col26': np.mean, 'col27': np.mean) for two columns?
Mt dataframe contains hourly data and I want to convert it to monthly data. I want to try something like that but it is nonsense:
for i in col_list:
df = df.resample('M').agg({i-2: np.sum, 'col26': np.mean, 'col27': np.mean})
Is there any shortcut for this situation?
You can try this, not for loop :
sum_col = ['col1','col2','col3','col4', ...]
sum_df = df.resample('M')[sum_col].sum()
mean_col = ['col26','col27']
mean_df = df.resample('M')[mean_col].mean()
df = sum_col.join(mean_df)

concat series onto dataframe with column name

I want to add a Series (s) to a Pandas DataFrame (df) as a new column. The series has more values than there are rows in the dataframe, so I am using the concat method along axis 1.
df = pd.concat((df, s), axis=1)
This works, but the new column of the dataframe representing the series is given an arbitrary numerical column name, and I would like this column to have a specific name instead.
Is there a way to add a series to a dataframe, when the series is longer than the rows of the dataframe, and with a specified column name in the resulting dataframe?
You can try Series.rename:
df = pd.concat((df, s.rename('col')), axis=1)
One option is simply to specify the name when creating the series:
example_scores = pd.Series([1,2,3,4], index=['t1', 't2', 't3', 't4'], name='example_scores')
Using the name attribute when creating the series is all I needed.
Try:
df = pd.concat((df, s.rename('CoolColumnName')), axis=1)