ggplot time series drop date and aligning time axis based on hours - ggplot2

I have a data frame with three variables, Name, Heart_rate, Time
df %>% str
$ Name : chr "A" "A" "A" "A" ...
$ Heart_rate: num 77 77 77 77 77 77 77 77 77 78 ...
$ Time : POSIXct, format: "2021-08-30 06:56:41"
and trying to plot using ggplot.
df %>% ggplot(aes(Time, Heart_rate, color=Name)) +geom_line()
The result is here.
The problem is those heart rates were measured on different days. I want to plot three person's heart rate time series based only on hours, minutes and seconds, dropping dates(ymd). I tried facet_wrap and scales free option, but the time axis does not match perfectly.
df %>% ggplot(aes(Time, Heart_rate, color=Name)) +geom_line() + facet_wrap(~Name,scales="free_x", nrow=3)
enter image description here
Do you have better ways to do this?

I figured it out. My way is to define new POSIXct type Time by choosing an arbitrary day.
df %>% mutate(Time2= as.Date('2022-01-01') +hms(Time %>% format(format='%H:%M:%S'))) %>%
ggplot(aes(Time2, Heart_rate, color=Name)) +geom_line()

Related

Duplicated rows when merging on pandas

I have a list that contains multiple pandas dataframes.
Each dataframe has columns 'Trading Day' and Maturity.
However the name of the column Maturity changes depending on the maturity, for example the first dataframe column names are: 'Trading Day', 'Y_2021','Y_2022'.
The second dataframe has 'Trading Day',Y_2022','Y_2023','Y_2024'.
The column 'Trading day' has all unique np.datetime64 dates for every dataframe
And the maturity columns have either floats or nans
My goal is to merge all the dataframes into one and have something like:
'Trading Day','Y_2021,'Y_2022','Y_2023',...'Y_2030'
In my code gh is the list that contains all the dataframes and original is a dataframe that contains all the dates from 5 years ago through today.
gt is the final dataframe.
So far what I have done is:
original = pd.DataFrame()
original['Trading Day'] = np.arange(np.datetime64(str(year_now-5)+('-01-01')), np.datetime64(date.today())+1)
for i in range(len(gh)):
gh[i]['Trading Day']=gh[i]['Trading Day'].astype('datetime64[ns]')
gt = pd.merge(original,gh[0],on='Trading Day',how = 'left')
for i in range (1,len(gh)):
gt=pd.merge(gt,gh[i],how='outer')
The code works more or less the problem is that when there is a change of years I get the following example results:
Y_2021 Y_2023 Y_2024
2020-06-05 45
2020-06-05 54
2020-06-05 43
2020-06-06 34
2020-06-06 23
2020-06-06 34
#While what I want is:
Y_2021 Y_2023 Y_2024
2020-06-05 45 54 43
2020-06-06 34 23 34
Given your actual output and what you want, you should be able to just:
output.ffill().bfill().drop_duplicates()
To get the output you want.
Found the fix:
gt = gt.groupby('Trading Day').sum()
gt = gt.replace(0, np.nan)

Pandas df histo, format my x ticker and include empty

I got this pandas df:
index TIME
12:07 2019-06-03 12:07:28
10:04 2019-06-04 10:04:25
11:14 2019-06-09 11:14:25
...
I use this command to do an histogram to plot how much occurence for each 15min periods
df['TIME'].groupby([df["TIME"].dt.hour, df["TIME"].dt.minute]).count().plot(kind="bar")
my plot look like this:
How can I get x tick like 10:15 in lieu of (10, 15) and how manage to add x tick missing like 9:15, 9:30... to get a complet time line??
You can resample your TIME column to 15 mins intervalls and count the number of rows. Then plot a regular bar chart.
df = pd.DataFrame({'TIME': pd.to_datetime('2019-01-01') + pd.to_timedelta(pd.np.random.rand(100) * 3, unit='h')})
df = df[df.TIME.dt.minute > 15] # make gap
ax = df.resample('15T', on='TIME').count().plot.bar(rot=0)
ticklabels = [x.get_text()[-8:-3] for x in ax.get_xticklabels()]
ax.xaxis.set_major_formatter(matplotlib.ticker.FixedFormatter(ticklabels))
(for details about formatting datetime ticklabels of pandas bar plots see this SO question)

Calculate the min, max and mean windspeeds and standard deviations

Calculate the min, max and mean windspeeds and standard deviations of the windspeeds
: across all locations for each week (assume that the first week starts on January 2 1961) for the first 52 weeks.
get data
https://github.com/prataplyf/Wind-DateTime/blob/master/wind_data.csv
not understad how to solve
weekly average of each location
RTP VAL ....... . ..... .. .. . . .. . . .. ... BEL MAL
1961-1-1
1961-1-8
1961-1-15
Load the data:
df = pd.read_csv('wind_data.csv')
Convert date to datetime and set as the index
df.date = pd.to_datetime(df.date)
df.set_index('date', drop=True, inplace=True)
Create a DateFrame for 1961
df_1961 = df[df.index < pd.to_datetime('1962-01-01')]
Resample for statistical calculations
df_1961.resample('W').mean()
df_1961.resample('W').min()
df_1961.resample('W').max()
df_1961.resample('W').std()
Plot the data for 1961:
fix, axes = plt.subplots(12, 1, figsize=(15, 60), sharex=True)
for name, ax in zip(df_1961.columns, axes):
ax.plot(df_1961[name], label='Daily')
ax.plot(df_1961_mean[name], label='Weekly Mean Resample')
ax.plot(df_1961_min[name], label='Weekly Min')
ax.plot(df_1961_max[name], label='Weekly Max')
ax.set_title(name)
ax.legend()

plot score against timestamp in pandas

I have a dataframe in pandas:
date_hour score
2019041822 -5
2019041823 0
2019041900 6
2019041901 -5
where date_hour is in YYYYMMDDHH format, and score is an int.
when I plot, there is a long line connecting 2019041823 to 2019041900, treating all the values in between as absent (ie. there is no score relating to 2019041824-2019041899, because there is no time relating to that).
Is there a way for these gaps/absetvalues to be ignored, so that it is continuous (Some of my data misses 2 days, so I have a long line which is misleading)
The red circles show the gap between nights (ie. between Apr 18 2300 and Apr 19 0000).
I used:
fig, ax = plt.subplots()
x=gpb['date_hour']
y=gpb['score']
ax.plot(x,y, '.-')
display(fig)
I believe it is because the date_hours is an int, and tried to convert to str, but was met with errors: ValueError: x and y must have same first dimension
Is there a way to plot so there are no gaps?
Try to convert date_hour to timestamp: df.date_hour = pd.to_datetime(df.date_hour, format='%Y%m%d%H') before plot.
df = pd.DataFrame({'date_hour':[2019041822, 2019041823, 2019041900, 2019041901],
'score':[-5,0,6,-5]})
df.date_hour = pd.to_datetime(df.date_hour, format='%Y%m%d%H')
df.plot(x='date_hour', y='score')
plt.show()
Output:
If you don't want to change your data, you can do
df = pd.DataFrame({'date_hour':[2019041822, 2019041823, 2019041900, 2019041901],
'score':[-5,0,6,-5]})
plt.plot(pd.to_datetime(df.date_hour, format='%Y%m%d%H'), df.score)
which gives:

How could i download a dataframe after performing some calculations on it , with the new result?

Link: https://gist.github.com/dishantrathi/541db1a19a8feaf114723672d998b857
Input was a set of date ranging from 2012 - 2015, and need to count the number of time a date repeated.
After counting, I have a dataset of dates and counted the unique counts of each date and now I have to download the unique count with the corresponding date in Ascending Order.
The output file should be in csv.
I believe you need reset_index for 2 column DataFrame from Series, sort by sort_values:
df1 = df.groupby('Date').size().reset_index(name='count').sort_values('count')
Another solution with value_counts:
df1 = (df['Date'].value_counts()
.rename_axis('Date')
.reset_index(name='count')
.sort_values('count'))
print (df1.head())
Date count
66 02-05-2014 54
594 13-05-2014 56
294 07-02-2014 57
877 19-04-2013 58
162 04-05-2014 59
df1.to_csv('file.csv', index=False)