Formatting Matplot libs [duplicate] - pandas

This question already has answers here:
How to force the Y axis to only use integers
(3 answers)
How to set xlim and ylim for a subplot [duplicate]
(1 answer)
How to set limit range (xlim) in python matplotlib?
(2 answers)
Closed 29 days ago.
I've got a dataset that looks a bit like this.
df
headline some_url time is_national
0 Holloway url 2023-01-11 11:44:27 True
1 London url 2023-01-11 11:25:10 False
2 Viral url 2023-01-11 10:43:39 False
3 London url 2023-01-11 09:41:18 True
4 Royal url 2023-01-11 15:49:38 False
I've been able to create a categorical column for day of the week thus:
cats = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
df['day_of_week'] = df.time.dt.day_name()
df['day_of_week'] = pd.Categorical(df['day_of_week'], categories=cats, ordered=True)
I planned to create an hour column like this:
df['hour'] = df.time.dt.hour
But the hour column comes out as a floating point.
The result when plotted is:
How do I avoid the floating point?
The second query is two-fold. I can produce a histogram of each using the .plot(kind=hist) function in pandas like so:
But the KDE plot with this query:
ax = df.hour.plot(kind='kde', title="Articles by hour")
ax.set_xlabel("Hour")
ax.set_ylabel("Number of articles")
Which looks like this:
Is there a simple way of cropping the plot to avoid minus hours or hours beyond 24?

Related

Does a sorted dataframe keep its order after groupby? [duplicate]

This question already has answers here:
Take the positive value for a primary key in case of duplicates
(1 answer)
Pandas filter maximum groupby
(2 answers)
Closed 7 months ago.
I would like to keep the latest entry per group in a dataframe:
from datetime import date
import pandas as pd
data = [
['A', date(2018,2,1), "I want this"],
['A', date(2018,1,1), "Don't want"],
['B', date(2019,4,1), "Don't want"],
['B', date(2019,5,1), "I want this"]]
df = pd.DataFrame(data, columns=['name', 'date', 'result'])
The following does what I want (found and credits here):
df.sort_values('date').groupby('name').tail(1)
name date result
0 A 2018-02-01 I want this
3 B 2019-05-01 I want this
But how do I know the order is always preserved when you do a groupby on a sorted data frame like df? Is it somewhere documented?
No it won't. Try to replace A with Z to see it.
Use sort=False:
df.sort_values('date').groupby('name', sort=False).tail(1)

increasing x y values in matpplot [duplicate]

This question already has answers here:
How to set ticks on Fixed Position , matplotlib
(2 answers)
Closed 2 years ago.
I have made a plot in Jupyter and I got output as shown in picture but I want to increase my x values like it has 1995,2000,2005,2010,2015 on x axis and I want more x values like say 1995,1997,1999,2001,2003...so on.
I have enter this code but I am unable to produce more x values and y values as mentioned before.
fig=plt.figure(figsize=(9, 7), dpi= 100, facecolor='w', edgecolor='k')
plt.plot(df_3)
You can use xticks such as
plt.xticks(min_x, max_x+1, 1.0)
You can also set your interval 2 or 3 to have more number on your x-axis.

Pandas df histo, format my x ticker and include empty

I got this pandas df:
index TIME
12:07 2019-06-03 12:07:28
10:04 2019-06-04 10:04:25
11:14 2019-06-09 11:14:25
...
I use this command to do an histogram to plot how much occurence for each 15min periods
df['TIME'].groupby([df["TIME"].dt.hour, df["TIME"].dt.minute]).count().plot(kind="bar")
my plot look like this:
How can I get x tick like 10:15 in lieu of (10, 15) and how manage to add x tick missing like 9:15, 9:30... to get a complet time line??
You can resample your TIME column to 15 mins intervalls and count the number of rows. Then plot a regular bar chart.
df = pd.DataFrame({'TIME': pd.to_datetime('2019-01-01') + pd.to_timedelta(pd.np.random.rand(100) * 3, unit='h')})
df = df[df.TIME.dt.minute > 15] # make gap
ax = df.resample('15T', on='TIME').count().plot.bar(rot=0)
ticklabels = [x.get_text()[-8:-3] for x in ax.get_xticklabels()]
ax.xaxis.set_major_formatter(matplotlib.ticker.FixedFormatter(ticklabels))
(for details about formatting datetime ticklabels of pandas bar plots see this SO question)

Apply diffs down columns of pandas dataframe [duplicate]

This question already has answers here:
How to replace NaNs by preceding or next values in pandas DataFrame?
(10 answers)
Closed 3 years ago.
I want to apply diffs down columns for a pandas dataframe.
EX:
A B C
23 40000 1
24 nan nan
nan 42000 2
I would want something like:
A B C
23 40000 1
24 40000 1
24 42000 2
I have tried variations of pandas groupby. I think this is probably the right approach. (or applying some function down columns, but not sure if this is efficient correct me if i'm wrong)
I was able to "apply diffs down the column" and get something like:
A B C
24 42000 2
by calling: df = df.groupby('col', as_index=False).last() for each column, but this is not what I am looking for. I am not a pandas expert so apologies if this is a silly question.
Explained above
Look at this: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
df = df.fillna(method='ffill')

How do I plot 2 columns of a pandas dataframe excluding rows selected by a third column [duplicate]

This question already has answers here:
Scatter plots in Pandas/Pyplot: How to plot by category [duplicate]
(8 answers)
Closed 4 years ago.
I have a pandas dataframe that looks like this:
Nstrike Nprem TDays
0 0.920923 0.000123 2
1 0.951621 0.000246 2
2 0.957760 0.001105 2
..............................
16 0.583251 0.000491 7
17 0.613949 0.000614 7
18 0.675344 0.000368 7
..............................
100 1.013016 0.029592 27
101 1.043713 0.049730 27
102 1.074411 0.071218 27
etc.
I would like to plot a graph of col.1 vs col.2, in separate plots as selected by col.3, maybe even in different colors.
The only way I can see to do that is to separate the dataframe into discrete dataframes for each col.3 value.
Or I could give up on pandas and just make the col.3 subsets into plain python arrays.
I am free to change the structure of the dataframe if it would simplify the problem.
IIUC, you can use this as a skeleton, and customize it how you want:
for g, data in df.groupby('TDays'):
plt.plot(data.Nstrike, data.Nprem, label='TDays '+str(g))
plt.legend()
plt.savefig('plot_'+str(g)+'.png')
plt.close()
Your first plot will look like:
Your second:
And so on