How to set xticks for the index of string with hvplot - pandas

I have a dataframe region_cumulative_df_sel as below:
Month-Day regions RAIN_PERCENTILE_25 RAIN_PERCENTILE_50 RAIN_PERCENTILE_75 RAIN_MEAN RAIN_MEDIAN
07-01 1 0.0611691028 0.2811064720 1.9487996101 1.4330813885 0.2873695195
07-02 1 0.0945720226 0.8130480051 4.5959815979 2.9420840740 1.0614821911
07-03 1 0.2845511734 1.1912839413 5.5803232193 3.7756001949 1.1988518238
07-04 1 0.3402922750 3.2274529934 7.4262523651 5.2195668221 3.2781836987
07-05 1 0.4680584669 5.2418060303 8.6639881134 6.9092760086 5.3968687057
07-06 1 2.4329853058 7.3453550339 10.8091869354 8.7898645401 7.5020875931
... ...
... ...
... ...
06-27 1 382.7809448242 440.1162109375 512.6233520508 466.4956665039 445.0971069336
06-28 1 383.8329162598 446.2222900391 513.2116699219 467.9851379395 451.1973266602
06-29 1 385.7786254883 449.5384826660 513.4027099609 469.5671691895 451.2281188965
06-30 1 386.7952270508 450.6524658203 514.0201416016 471.2863159180 451.2484741211
The index "Month-Day" is a type of String indicating the first day and the last day of a calendar year instead of type of datetime.
I need to use hvplot to develop an interactive plot.
region_cumulative_df_sel.hvplot(width=900)
It is hard to view the labels on the x axis. How can change the xticks to show only 1st of each month, e.g. "07-01", "08-01", "09-01", ... ..., "06-01"?
I tried #Redox code as below:
region_cumulative_df_sel['Month-Day'] = pd.to_datetime(region_cumulative_df_sel['Month-Day'],format="%m-%d") ##Convert to datetime
from bokeh.models.formatters import DatetimeTickFormatter
## Set format for showing x-axis ... you only need days, but in case counts change
formatter = DatetimeTickFormatter(days=["%m-%d"], months=["%m-%d"], years=["%m-%d"])
region_cumulative_df_sel.plot(x='Month-Day', xformatter=formatter, y=['RAIN_PERCENTILE_25','RAIN_PERCENTILE_50','RAIN_PERCENTILE_75','RAIN_MEAN','RAIN_MEDIAN'], width=900, ylabel="Rainfall (mm)",
rot=90, title="Cumulative Rainfall")
This is what I have generated.
How can I shift the xticks on the x-axis to align with the Month-Day values. Also the popup window shows "1900" as year for Month-Day column. Can the year segment be removed?

The x-axis data is in string format. So, holoviews thinks this is categorical and plotting every row. You need to convert it to datetime and this will allow the plotting to be in the format you need. I am taking a simple example and showing how to do this... should work in your case as well...
##My month-day column is string - 07-01 07-02 07-03 07-04 ... 12-31
df['Month-Day']=pd.to_datetime(df['Month-Day'],format="%m-%d") ##Convert to datetime
df['myY']=np.random.randint(100, size=(len(df))) ##Random Y data
from bokeh.models.formatters import DatetimeTickFormatter
## Set format for showing x-axis ... you only need days, but in case counts change
formatter = DatetimeTickFormatter(days=["%m-%d"], months=["%m-%d"], years=["%m-%d"])
##Plot graph
df.plot(x='Month-Day',xformatter=formatter)#.opts(xticks=4, xrotation=90)

#Redox is on the right track here. The issue is with the way the Month-Day column is converted to a datetime; pandas is assuming the year is 1900 for every row.
Essentially you need to attach a year to the Month-Day in some way.
See the example below, this takes the first month-day string, prepends "2022-" and generates sequential daily values for every row (but there are a few ways of doing this).
code:
import pandas as pd
import numpy as np
import hvplot.pandas
from bokeh.models.formatters import DatetimeTickFormatter
dates = pd.date_range("2021-07-01", "2022-06-30", freq="D")
df = pd.DataFrame({
"md": dates.strftime("%m-%d"),
"ign": np.cumsum(np.random.normal(10, 5, len(dates))),
"sup": np.cumsum(np.random.normal(20, 10, len(dates))),
"imp": np.cumsum(np.random.normal(30, 15, len(dates))),
})
df["time"] = pd.date_range("2021-" + df.md[0], periods=len(df.index), freq="D")
formatter = DatetimeTickFormatter(
days=["%m-%d"], months=["%m-%d"], years=["%m-%d"])
df.hvplot(x='time', xformatter=formatter, y=['ign', 'sup', 'imp'],
width=900, ylabel="Index", rot=90, title="Cumulative ISI")

Related

Slicing by date, using a variable start date

I trying to slice according to a date column (which is calculated based on the index), and only cumulative summing based on the Start Date beside it.
Here is a small sample code to copy/run:
import numpy
import pandas
data = pandas.DataFrame(
{"Bought" : [1,3,4,6]}, index=pandas.to_datetime(['01-01-2020','02-01-2020','03-01-2020','04-01-2020']))
data['StartDate'] = data.index
data['Cum bought2'] = data.loc[data['StartDate']:]['Bought'].cumsum()
It gives me the error "cannot do slice indexing on DatetimeIndex with these indexers".
If I change the data.loc[data['StartDate']:] to a set value (i.e. '02-01-2020'), then it works fine. But I want the start date to be variable and taken from another column.
Edit1: new example. This is close, but the 3rd row shouldn't calculate a value since the Start Date hasn't been reached yet.
import numpy
import pandas
data = pandas.DataFrame(
{"Bought" : [1,3,4,6]}, index=pandas.to_datetime(['01-01-2020','02-01-2020','03-01-2020','04-01-2020']))
data['StartDate'] = ['02-01-2020','02-01-2020','04-01-2020','04-01-2020']
data['Cum Bought'] = data.loc[data['StartDate'].iloc[0]:]['Bought'].cumsum()
Edit2: Also, any idea how to resolve if there are pandas.NaT in the Start Date? I don't want to delete those rows completely, just treat them as zero in calculations.
import numpy
import pandas
data = pandas.DataFrame(
{"Bought" : [1,3,4,6]}, index=pandas.to_datetime(['01-01-2020','02-01-2020','03-01-2020','04-01-2020']))
data['StartDate'] = [pandas.NaT,'02-01-2020','04-01-2020','04-01-2020']
data['Cum Bought'] = data.loc[data['StartDate'].iloc[0]:]['Bought'].cumsum()
You're trying to index with a Series as bound of a slice, which doesn't make sense. You need one value. data.loc[data['StartDate'].iloc[0]:] or data.loc[data['StartDate'].min():] would work.
In your case, you should probably just use:
data['Cum bought2'] = data['Bought'].cumsum()
Or if you're not sure that the dates are sorted:
data['Cum bought2'] = data['Bought'].sort_index().cumsum()
Output:
Bought StartDate Cum bought2
2020-01-01 1 2020-01-01 1
2020-02-01 3 2020-02-01 4
2020-03-01 4 2020-03-01 8
2020-04-01 6 2020-04-01 14

How can I always choose the last column in a csv table that's updated monthly?

Automating small business reporting from my Quickbooks P&L. I'm trying to get the net income value for the current month from a specific cell in a dataframe, but that cell moves one column to the right every month when I update the csv file.
For example, for the code below, this month I want the value from Nov[0], but next month I'll want the value from Dec[0], even though that column doesn't exist yet.
Is there a graceful way to always select the second right most column, or is this a stupid way to try and get this information?
import numpy as np
import pandas as pd
nov = -810
dec = 14958
total = 8693
d = {'Jan': [50], 'Feb': [70], 'Total':[120]}
df = pd.DataFrame(data=d)
Sure, you can reference the last or second-to-last row or column.
d = {'Jan': [50], 'Feb': [70], 'Total':[120]}
df = pd.DataFrame(data=d)
x = df.iloc[-1,-2]
This will select the value in the last row for the second-to-last column, in this case 70. :)
If you plan to use the full file, #VincentRupp's answer will get you what you want.
But if you only plan to use the values in the second right most column and you can infer what it will be called, you can tell pd.read_csv that's all you want.
import pandas as pd # 1.5.1
# assuming we want this month's name
# can modify to use some other month
abbreviated_month_name = pd.to_datetime("today").strftime("%b")
df = pd.read_csv("path/to/file.csv", usecols=[abbreviated_month_name])
print(df.iloc[-1, 0])
References
pd.read_csv
strftime cheat-sheet

Interpolate values based in date in pandas

I have the following datasets
import pandas as pd
import numpy as np
df = pd.read_excel("https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet1")
df2 = pd.read_excel("https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet2")
df2.dropna(inplace = True)
For each group of values on the first df X-Axis Value, Y-Axis Value, where the first one is the date and the second one is a value, I would like to create rows with the same date. For instance, df.iloc[0,0] the timestamp is Timestamp('2020-08-25 23:14:12'). However, in the following columns of the same row maybe there is other dates with different Y-Axis Value associated. The first one in that specific row being X-Axis Value NCVE-064 HPNDE with a timestap 2020-08-25 23:04:12 and a Y-Axis Value associated of value 0.952.
What I want to accomplish is to interpolate those values for a time interval, maybe 10 minutes, and then merge those results to have the same date for each row.
For the df2 is moreless the same, interpolate the values in a time interval and add them to the original dataframe. Is there any way to do this?
The trick is to realize that datetimes can be represented as seconds elapsed with respect to some time.
Without further context part the hardest things is to decide at what times you wants to have the interpolated values.
import pandas as pd
import numpy as np
from scipy.interpolate import interp1d
df = pd.read_excel(
"https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet1",
)
x_columns = [col for col in df.columns if col.startswith("X-Axis")]
# What time do we want to align the columsn to?
# You can use anything else here or define equally spaced time points
# or something else.
target_times = df[x_columns].min(axis=1)
def interpolate_column(target_times, x_times, y_values):
ref_time = x_times.min()
# For interpolation we need to represent the values as floats. One options is to
# compute the delta in seconds between a reference time and the "current" time.
deltas = (x_times - ref_time).dt.total_seconds()
# repeat for our target times
target_times_seconds = (target_times - ref_time).dt.total_seconds()
return interp1d(deltas, y_values, bounds_error=False,fill_value="extrapolate" )(target_times_seconds)
output_df = pd.DataFrame()
output_df["Times"] = target_times
output_df["Y-Axis Value NCVE-063 VPNDE"] = interpolate_column(
target_times,
df["X-Axis Value NCVE-063 VPNDE"],
df["Y-Axis Value NCVE-063 VPNDE"],
)
# repeat for the other columns, better in a loop

How to categorize a range of hours in Pandas?

In my project I am trying to create a new column to categorize records by range of hours, let me explain, I have a column in the dataframe called 'TowedTime' with time series data, I want another column to categorize by full hour without minutes, for example if the value in the 'TowedTime' column is 09:32:10 I want it to be categorized as 9 AM, if says 12:45:10 it should be categorized as 12 PM and so on with all the other values. I've read about the .cut and bins function but I can't get the result I want.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
df = pd.read_excel("Baltimore Towing Division.xlsx",sheet_name="TowingData")
df['Month'] = pd.DatetimeIndex(df['TowedDate']).strftime("%b")
df['Week day'] = pd.DatetimeIndex(df['TowedDate']).strftime("%a")
monthOrder = ['Jan', 'Feb', 'Mar', 'Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
dayOrder = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
pivotHours = pd.pivot_table(df, values='TowedDate',index='TowedTime',
columns='Week day',
fill_value=0,
aggfunc= 'count',
margins = False, margins_name='Total').reindex(dayOrder,axis=1)
print(pivotHours)
First, make sure the type of the column 'TowedTime' is datetime. Second, you can easily extract the hour from this data type.
df['TowedTime'] = pd.to_datetime(df['TowedTime'],format='%H:%M:%S')
df['hour'] = df['TowedTime'].dt.hour
hope it answers your question
With the help of #Fabien C I was able to solve the problem.
First, I had to check the data type of values in the 'TowedTime' column with dtypes function. I found that were a Object.
I proceed to try convert 'TowedTime' to datetime:
df['TowedTime'] = pd.to_datetime(df['TowedTime'],format='%H:%M:%S').dt.time
Then to create a new column in the df, for only the hours:
df['Hour'] = pd.to_datetime(df['TowedTime'],format='%H:%M:%S').dt.hour
And the result was this:
You can notice in the image that 'TowedTime' column remains as an object, but the new 'Hour' column correctly returns the hour value.
Originally, the dataset already had the date and time separated into different columns, I think they used some method to separate date and time in excel and this created the time ('TowedTime') to be an object, I could not convert it, Or at least that's what the dtypes function shows me.
I tried all this Pandas methods for converting the Object to Datetime :
df['TowedTime'] = pd.to_datetime(df['TowedTime'])
df['TowedTime'] = pd.to_datetime(df['TowedTime'])
df['TowedTime'] = df['TowedTime'].astype('datetime64[ns]')
df['TowedTime'] = pd.to_datetime(df['TowedTime'], format='%H:%M:%S')
df['TowedTime'] = pd.to_datetime(df['TowedTime'], format='%H:%M:%S')

how to group pandas timestamps plot several plots in one figure and stack them together in matplotlib?

I have a data frame with perfectly organised timestamps, like below:
It's a web log, and the timestamps go though the whole year. I want to cut them into each day and show the visits within each hour and plot them into the same figure and stack them all together. Just like the pic shown below:
I am doing well on cutting them into days and plot the visits of a day individually, but I am having trouble plotting them and stacking them together. The primary tool I am using is Pandas and Matplotlib.
Any advices and suggestions? Much Appreciated!
Edited:
My Code is as below:
The timestamps are: https://gist.github.com/adamleo/04e4147cc6614820466f7bc05e088ac5
And the dataframe looks like this:
I plotted the timestamp density through the whole period used the code below:
timestamps_series_all = pd.DatetimeIndex(pd.Series(unique_visitors_df.time_stamp))
timestamps_series_all_toBePlotted = pd.Series(1, index=timestamps_series_all)
timestamps_series_all_toBePlotted.resample('D').sum().plot()
and got the result:
I plotted timestamps within one day using the code:
timestamps_series_oneDay = pd.DatetimeIndex(pd.Series(unique_visitors_df.time_stamp.loc[unique_visitors_df["date"] == "2014-08-01"]))
timestamps_series_oneDay_toBePlotted = pd.Series(1, index=timestamps_series_oneDay)
timestamps_series_oneDay_toBePlotted.resample('H').sum().plot()
and the result:
And now I am stuck.
I'd really appreciate all of your help!
I think you need pivot:
#https://gist.github.com/adamleo/04e4147cc6614820466f7bc05e088ac5 to L
df = pd.DataFrame({'date':L})
print (df.head())
date
0 2014-08-01 00:05:46
1 2014-08-01 00:14:47
2 2014-08-01 00:16:05
3 2014-08-01 00:20:46
4 2014-08-01 00:23:22
#convert to datetime if necessary
df['date'] = pd.to_datetime(df['date'] )
#resample by Hours, get count and create df
df = df.resample('H', on='date').size().to_frame('count')
#extract date and hour
df['days'] = df.index.date
df['hours'] = df.index.hour
#pivot and plot
#maybe check parameter kind='density' from http://stackoverflow.com/a/33474410/2901002
#df.pivot(index='days', columns='hours', values='count').plot(rot='90')
#edit: last line change to below:
df.pivot(index='hours', columns='days', values='count').plot(rot='90')