how to group pandas timestamps plot several plots in one figure and stack them together in matplotlib? - pandas

I have a data frame with perfectly organised timestamps, like below:
It's a web log, and the timestamps go though the whole year. I want to cut them into each day and show the visits within each hour and plot them into the same figure and stack them all together. Just like the pic shown below:
I am doing well on cutting them into days and plot the visits of a day individually, but I am having trouble plotting them and stacking them together. The primary tool I am using is Pandas and Matplotlib.
Any advices and suggestions? Much Appreciated!
Edited:
My Code is as below:
The timestamps are: https://gist.github.com/adamleo/04e4147cc6614820466f7bc05e088ac5
And the dataframe looks like this:
I plotted the timestamp density through the whole period used the code below:
timestamps_series_all = pd.DatetimeIndex(pd.Series(unique_visitors_df.time_stamp))
timestamps_series_all_toBePlotted = pd.Series(1, index=timestamps_series_all)
timestamps_series_all_toBePlotted.resample('D').sum().plot()
and got the result:
I plotted timestamps within one day using the code:
timestamps_series_oneDay = pd.DatetimeIndex(pd.Series(unique_visitors_df.time_stamp.loc[unique_visitors_df["date"] == "2014-08-01"]))
timestamps_series_oneDay_toBePlotted = pd.Series(1, index=timestamps_series_oneDay)
timestamps_series_oneDay_toBePlotted.resample('H').sum().plot()
and the result:
And now I am stuck.
I'd really appreciate all of your help!

I think you need pivot:
#https://gist.github.com/adamleo/04e4147cc6614820466f7bc05e088ac5 to L
df = pd.DataFrame({'date':L})
print (df.head())
date
0 2014-08-01 00:05:46
1 2014-08-01 00:14:47
2 2014-08-01 00:16:05
3 2014-08-01 00:20:46
4 2014-08-01 00:23:22
#convert to datetime if necessary
df['date'] = pd.to_datetime(df['date'] )
#resample by Hours, get count and create df
df = df.resample('H', on='date').size().to_frame('count')
#extract date and hour
df['days'] = df.index.date
df['hours'] = df.index.hour
#pivot and plot
#maybe check parameter kind='density' from http://stackoverflow.com/a/33474410/2901002
#df.pivot(index='days', columns='hours', values='count').plot(rot='90')
#edit: last line change to below:
df.pivot(index='hours', columns='days', values='count').plot(rot='90')

Related

How to set xticks for the index of string with hvplot

I have a dataframe region_cumulative_df_sel as below:
Month-Day regions RAIN_PERCENTILE_25 RAIN_PERCENTILE_50 RAIN_PERCENTILE_75 RAIN_MEAN RAIN_MEDIAN
07-01 1 0.0611691028 0.2811064720 1.9487996101 1.4330813885 0.2873695195
07-02 1 0.0945720226 0.8130480051 4.5959815979 2.9420840740 1.0614821911
07-03 1 0.2845511734 1.1912839413 5.5803232193 3.7756001949 1.1988518238
07-04 1 0.3402922750 3.2274529934 7.4262523651 5.2195668221 3.2781836987
07-05 1 0.4680584669 5.2418060303 8.6639881134 6.9092760086 5.3968687057
07-06 1 2.4329853058 7.3453550339 10.8091869354 8.7898645401 7.5020875931
... ...
... ...
... ...
06-27 1 382.7809448242 440.1162109375 512.6233520508 466.4956665039 445.0971069336
06-28 1 383.8329162598 446.2222900391 513.2116699219 467.9851379395 451.1973266602
06-29 1 385.7786254883 449.5384826660 513.4027099609 469.5671691895 451.2281188965
06-30 1 386.7952270508 450.6524658203 514.0201416016 471.2863159180 451.2484741211
The index "Month-Day" is a type of String indicating the first day and the last day of a calendar year instead of type of datetime.
I need to use hvplot to develop an interactive plot.
region_cumulative_df_sel.hvplot(width=900)
It is hard to view the labels on the x axis. How can change the xticks to show only 1st of each month, e.g. "07-01", "08-01", "09-01", ... ..., "06-01"?
I tried #Redox code as below:
region_cumulative_df_sel['Month-Day'] = pd.to_datetime(region_cumulative_df_sel['Month-Day'],format="%m-%d") ##Convert to datetime
from bokeh.models.formatters import DatetimeTickFormatter
## Set format for showing x-axis ... you only need days, but in case counts change
formatter = DatetimeTickFormatter(days=["%m-%d"], months=["%m-%d"], years=["%m-%d"])
region_cumulative_df_sel.plot(x='Month-Day', xformatter=formatter, y=['RAIN_PERCENTILE_25','RAIN_PERCENTILE_50','RAIN_PERCENTILE_75','RAIN_MEAN','RAIN_MEDIAN'], width=900, ylabel="Rainfall (mm)",
rot=90, title="Cumulative Rainfall")
This is what I have generated.
How can I shift the xticks on the x-axis to align with the Month-Day values. Also the popup window shows "1900" as year for Month-Day column. Can the year segment be removed?
The x-axis data is in string format. So, holoviews thinks this is categorical and plotting every row. You need to convert it to datetime and this will allow the plotting to be in the format you need. I am taking a simple example and showing how to do this... should work in your case as well...
##My month-day column is string - 07-01 07-02 07-03 07-04 ... 12-31
df['Month-Day']=pd.to_datetime(df['Month-Day'],format="%m-%d") ##Convert to datetime
df['myY']=np.random.randint(100, size=(len(df))) ##Random Y data
from bokeh.models.formatters import DatetimeTickFormatter
## Set format for showing x-axis ... you only need days, but in case counts change
formatter = DatetimeTickFormatter(days=["%m-%d"], months=["%m-%d"], years=["%m-%d"])
##Plot graph
df.plot(x='Month-Day',xformatter=formatter)#.opts(xticks=4, xrotation=90)
#Redox is on the right track here. The issue is with the way the Month-Day column is converted to a datetime; pandas is assuming the year is 1900 for every row.
Essentially you need to attach a year to the Month-Day in some way.
See the example below, this takes the first month-day string, prepends "2022-" and generates sequential daily values for every row (but there are a few ways of doing this).
code:
import pandas as pd
import numpy as np
import hvplot.pandas
from bokeh.models.formatters import DatetimeTickFormatter
dates = pd.date_range("2021-07-01", "2022-06-30", freq="D")
df = pd.DataFrame({
"md": dates.strftime("%m-%d"),
"ign": np.cumsum(np.random.normal(10, 5, len(dates))),
"sup": np.cumsum(np.random.normal(20, 10, len(dates))),
"imp": np.cumsum(np.random.normal(30, 15, len(dates))),
})
df["time"] = pd.date_range("2021-" + df.md[0], periods=len(df.index), freq="D")
formatter = DatetimeTickFormatter(
days=["%m-%d"], months=["%m-%d"], years=["%m-%d"])
df.hvplot(x='time', xformatter=formatter, y=['ign', 'sup', 'imp'],
width=900, ylabel="Index", rot=90, title="Cumulative ISI")

Matplotlib y axis not displaying from low values to high values

i'm a newbie in spyder especially matplotlib. Currently i am trying to display stock price data from dates of 2014-2019.
The unexpected errors i have encountered are the display of data is incorrect .
There are 2 flaws and workarounds i have attempted.
Is the error in displaying caused due to the first index being the year 2019 instead of the year 2014?
Flaw 1 : plt.yticks([0,140]) Does not even display ticks from 0 to 140.
Not displaying ticks from 0 to 140
Flaw 2 : If i remove the conversion of the Date to a datetime format. The display error will be half rectified but a new error will be displayed as the x axis cannot be displayed correctly.
Data is presented correctly from descending to ascending order but date cannot be displayed as it is not in datetime format.
In essence the main flaw is the tickers of the y axis cannot be displayed and the data needs to be presented in descending to ascending order.
I've spent hours trying to fix this, my hair is falling off, i'm gonna be bald soon.
The data is read from year 2019 to 2014.
As the data is organized as of column 1 is: 2019 and the last column is 2014.
I understand that the data is cramped during display but even changing the view to automatic and not inline will not solve this issue.
The code is as below
file = 'Microsoft.csv'
df = pd.read_csv(file)
## convert timestamp
# df['Date'] = pd.to_datetime(df['Date'], format = '%m/%d/%Y')
# df['Close'] = df['Close'].str.replace('$', '')
# df['Close'].astype('float')
##Not Necessary
# df['Close'].apply(lambda x: float(x))
# df.Close = float(df.Close)
# df['Close'] = df.Close.astype(float)
##Beginning of plot
plt.plot(df.Date, df.Close)
plt.yticks([0,140])
plt.ylim(0, 1511)
plt.suptitle('Stock Price')
plt.title('Microsoft', fontdict={'fontsize':15,'fontweight':'bold'})
plt.xlabel('Year')
plt.ylabel('Price in USD')
plt.show()
This is the original plotting without adding the yticks.
X-axis not converted to datetime, Y-Axis does not have yticks
Any help would be grateful, the goal is to display the plot from descending to ascending order with yticks being able to display from any order 0,20,40,60,80,100,120,140 or any amount.

Reshape Pandas dataframe (partial transpose)

I have a csv similar to the following, where the column heading specifies the time (hour number):
Day,Location,1,2,3
1/1/2021,A,0.26,0.25,0.49
1/1/2021,B,0.8,0.23,0.55
1/1/2021,C,0.32,0.11,0.58
1/2/2021,A,0.67,0.72,0.49
1/2/2021,B,0.25,0.09,0.56
1/2/2021,C,0.83,0.54,0.7
When I load it as a dataframe using
df = pd.read_csv(open('VirusLevels.csv', 'r'), index_col=[0,1], header=0)
Pandas creates a dataframe with indices Day and Location, and column names 1, 2, and 3.
I need it to be reshaped as shown below, where Day and Time are the indices, and the Location is the column heading:
I've tried a lot of things and followed a lot of rabbitholes, but haven't been successful. The most on-point example I could find suggested something like the following, but it doesn't work (says "KeyError: 'Day'").
df.melt(id_vars=['Day'], var_name= 'Time',
value_name = 'VirusLevels').sort_values(by='Location').reset_index(drop=True)
Thanks in advance for any help.
Try:
df = pd.read_csv('VirusLevels.csv', index_col=[0,1])
df.rename_axis(columns='Time').stack().unstack('Location')
# or
# df.rename_axis('Time',axis='columns').stack().unstack('Location')
Output:
Location A B C
Day Time
1/1/2021 1 0.345307 0.099403 0.474077
2 0.299947 0.853091 0.352472
3 0.400975 0.599249 0.743099
1/2/2021 1 0.660258 0.003976 0.295406
2 0.425434 0.953433 0.418783
3 0.421021 0.844761 0.369561

How can I draw Yearly series using monthly data from a DateTimeIndex in Matplotlib?

I have monthly data of 6 variables from 2014 until 2018 in one dataset.
I'm trying to draw 6 subplots (one for each variable) with monthly X axis (Jan, Feb....) and 5 series (one for each year) with their legend.
This is part of the data:
I created 5 series (one for each year) per variable (30 in total) and I'm getting the expected output but using MANY lines of code.
What is the best way to achieve this using less lines of code?
This is an example how I created the series:
CL2014 = data_total['Charity Lottery'].where(data_total['Date'].dt.year == 2014)[0:12]
CL2015 = data_total['Charity Lottery'].where(data_total['Date'].dt.year == 2015)[12:24]
This is an example of how I'm plotting the series:
axCL.plot(xvals, CL2014)
axCL.plot(xvals, CL2015)
axCL.plot(xvals, CL2016)
axCL.plot(xvals, CL2017)
axCL.plot(xvals, CL2018)
There's no need to litter your namespace with 30 variables. Seaborn makes the job very easy but you need to normalize your dataframe first. This is what "normalized" or "unpivoted" looks like (Seaborn calls this "long form"):
Date variable value
2014-01-01 Charity Lottery ...
2014-01-01 Racecourse ...
2014-04-01 Bingo Halls ...
2014-04-01 Casino ...
Your screenshot is a "pivoted" or "wide form" dataframe.
df_plot = pd.melt(df, id_vars='Date')
df_plot['Year'] = df_plot['Date'].dt.year
df_plot['Month'] = df_plot['Date'].dt.strftime('%b')
import seaborn as sns
plot = sns.catplot(data=df_plot, x='Month', y='value',
row='Year', col='variable', kind='bar',
sharex=False)
plot.savefig('figure.png', dpi=300)
Result (all numbers are randomly generated):
I would try using .groupby(), it is really powerful for parsing down things like this:
for _, group in data_total.groupby([year, month])[[x_variable, y_variable]]:
plt.plot(group[x_variables], group[y_variables])
So here the groupby will separate your data_total DataFrame into year/month subsets, with the [[]] on the end to parse it down to the x_variable (assuming it is in your data_total DataFrame) and your y_variable, which you can make any of those features you are interested in.
I would decompose your datetime column into separate year and month columns, then use those new columns inside that groupby as the [year, month]. You might be able to pass in the dt.year and dt.month like you had before... not sure, try it both ways!

pandas df.resample('D').sum() returns NaN

I've got a pandas data frame with electricity meter readings(cumulative). The df DatetimeIndex dtype='datetime64[ns]'. When I load the .csv file the dataframe does not contain any NaN values. I need to calculate both the monthly and daily energy generated.
To calculate monthly generation I use dfmonth = df.resample('M').sum() . This works fine.
To calculate daily generation I thought of using: dfday = df.resample('D').sum(). Which partially works but for some index dates (no data missing in raw file) returns NaN.
Please see code below. Does anyone knows why this happens? Any proposed solution?
df = pd.read_csv(file)
df = df.set_index(pd.DatetimeIndex(df['Reading Timestamp']))
df=df.rename(columns = {'Energy kWh':'meter', 'Instantaneous Power kW (approx)': 'kW'})
df.drop(df.columns[:10], axis=1, inplace=True) #Delete columns I don't need.
df['kWh'] = df['meter'].sub(df['meter'].shift())
dfmonth = df.resample('M').sum() #This works OK calculating kWh. dfmonth does not contain any NaN.
dfday = df.resample('D').sum() # This returns a total of 8 NaN out of 596 sampled points. Original df has 27929 DatetimeIndex rows
Thank you in advance.
A big apology to you all. The .csv I was given and the raw .csv I was checking against are not the same file. Data was somehow corrupted....
I've been banging my head against the wall till now, there is not problem with df.resample('D').sum()
Sorry again, consider thread sorted.