In my project I am trying to create a new column to categorize records by range of hours, let me explain, I have a column in the dataframe called 'TowedTime' with time series data, I want another column to categorize by full hour without minutes, for example if the value in the 'TowedTime' column is 09:32:10 I want it to be categorized as 9 AM, if says 12:45:10 it should be categorized as 12 PM and so on with all the other values. I've read about the .cut and bins function but I can't get the result I want.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
df = pd.read_excel("Baltimore Towing Division.xlsx",sheet_name="TowingData")
df['Month'] = pd.DatetimeIndex(df['TowedDate']).strftime("%b")
df['Week day'] = pd.DatetimeIndex(df['TowedDate']).strftime("%a")
monthOrder = ['Jan', 'Feb', 'Mar', 'Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
dayOrder = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
pivotHours = pd.pivot_table(df, values='TowedDate',index='TowedTime',
columns='Week day',
fill_value=0,
aggfunc= 'count',
margins = False, margins_name='Total').reindex(dayOrder,axis=1)
print(pivotHours)
First, make sure the type of the column 'TowedTime' is datetime. Second, you can easily extract the hour from this data type.
df['TowedTime'] = pd.to_datetime(df['TowedTime'],format='%H:%M:%S')
df['hour'] = df['TowedTime'].dt.hour
hope it answers your question
With the help of #Fabien C I was able to solve the problem.
First, I had to check the data type of values in the 'TowedTime' column with dtypes function. I found that were a Object.
I proceed to try convert 'TowedTime' to datetime:
df['TowedTime'] = pd.to_datetime(df['TowedTime'],format='%H:%M:%S').dt.time
Then to create a new column in the df, for only the hours:
df['Hour'] = pd.to_datetime(df['TowedTime'],format='%H:%M:%S').dt.hour
And the result was this:
You can notice in the image that 'TowedTime' column remains as an object, but the new 'Hour' column correctly returns the hour value.
Originally, the dataset already had the date and time separated into different columns, I think they used some method to separate date and time in excel and this created the time ('TowedTime') to be an object, I could not convert it, Or at least that's what the dtypes function shows me.
I tried all this Pandas methods for converting the Object to Datetime :
df['TowedTime'] = pd.to_datetime(df['TowedTime'])
df['TowedTime'] = pd.to_datetime(df['TowedTime'])
df['TowedTime'] = df['TowedTime'].astype('datetime64[ns]')
df['TowedTime'] = pd.to_datetime(df['TowedTime'], format='%H:%M:%S')
df['TowedTime'] = pd.to_datetime(df['TowedTime'], format='%H:%M:%S')
Related
Automating small business reporting from my Quickbooks P&L. I'm trying to get the net income value for the current month from a specific cell in a dataframe, but that cell moves one column to the right every month when I update the csv file.
For example, for the code below, this month I want the value from Nov[0], but next month I'll want the value from Dec[0], even though that column doesn't exist yet.
Is there a graceful way to always select the second right most column, or is this a stupid way to try and get this information?
import numpy as np
import pandas as pd
nov = -810
dec = 14958
total = 8693
d = {'Jan': [50], 'Feb': [70], 'Total':[120]}
df = pd.DataFrame(data=d)
Sure, you can reference the last or second-to-last row or column.
d = {'Jan': [50], 'Feb': [70], 'Total':[120]}
df = pd.DataFrame(data=d)
x = df.iloc[-1,-2]
This will select the value in the last row for the second-to-last column, in this case 70. :)
If you plan to use the full file, #VincentRupp's answer will get you what you want.
But if you only plan to use the values in the second right most column and you can infer what it will be called, you can tell pd.read_csv that's all you want.
import pandas as pd # 1.5.1
# assuming we want this month's name
# can modify to use some other month
abbreviated_month_name = pd.to_datetime("today").strftime("%b")
df = pd.read_csv("path/to/file.csv", usecols=[abbreviated_month_name])
print(df.iloc[-1, 0])
References
pd.read_csv
strftime cheat-sheet
I have a dataframe region_cumulative_df_sel as below:
Month-Day regions RAIN_PERCENTILE_25 RAIN_PERCENTILE_50 RAIN_PERCENTILE_75 RAIN_MEAN RAIN_MEDIAN
07-01 1 0.0611691028 0.2811064720 1.9487996101 1.4330813885 0.2873695195
07-02 1 0.0945720226 0.8130480051 4.5959815979 2.9420840740 1.0614821911
07-03 1 0.2845511734 1.1912839413 5.5803232193 3.7756001949 1.1988518238
07-04 1 0.3402922750 3.2274529934 7.4262523651 5.2195668221 3.2781836987
07-05 1 0.4680584669 5.2418060303 8.6639881134 6.9092760086 5.3968687057
07-06 1 2.4329853058 7.3453550339 10.8091869354 8.7898645401 7.5020875931
... ...
... ...
... ...
06-27 1 382.7809448242 440.1162109375 512.6233520508 466.4956665039 445.0971069336
06-28 1 383.8329162598 446.2222900391 513.2116699219 467.9851379395 451.1973266602
06-29 1 385.7786254883 449.5384826660 513.4027099609 469.5671691895 451.2281188965
06-30 1 386.7952270508 450.6524658203 514.0201416016 471.2863159180 451.2484741211
The index "Month-Day" is a type of String indicating the first day and the last day of a calendar year instead of type of datetime.
I need to use hvplot to develop an interactive plot.
region_cumulative_df_sel.hvplot(width=900)
It is hard to view the labels on the x axis. How can change the xticks to show only 1st of each month, e.g. "07-01", "08-01", "09-01", ... ..., "06-01"?
I tried #Redox code as below:
region_cumulative_df_sel['Month-Day'] = pd.to_datetime(region_cumulative_df_sel['Month-Day'],format="%m-%d") ##Convert to datetime
from bokeh.models.formatters import DatetimeTickFormatter
## Set format for showing x-axis ... you only need days, but in case counts change
formatter = DatetimeTickFormatter(days=["%m-%d"], months=["%m-%d"], years=["%m-%d"])
region_cumulative_df_sel.plot(x='Month-Day', xformatter=formatter, y=['RAIN_PERCENTILE_25','RAIN_PERCENTILE_50','RAIN_PERCENTILE_75','RAIN_MEAN','RAIN_MEDIAN'], width=900, ylabel="Rainfall (mm)",
rot=90, title="Cumulative Rainfall")
This is what I have generated.
How can I shift the xticks on the x-axis to align with the Month-Day values. Also the popup window shows "1900" as year for Month-Day column. Can the year segment be removed?
The x-axis data is in string format. So, holoviews thinks this is categorical and plotting every row. You need to convert it to datetime and this will allow the plotting to be in the format you need. I am taking a simple example and showing how to do this... should work in your case as well...
##My month-day column is string - 07-01 07-02 07-03 07-04 ... 12-31
df['Month-Day']=pd.to_datetime(df['Month-Day'],format="%m-%d") ##Convert to datetime
df['myY']=np.random.randint(100, size=(len(df))) ##Random Y data
from bokeh.models.formatters import DatetimeTickFormatter
## Set format for showing x-axis ... you only need days, but in case counts change
formatter = DatetimeTickFormatter(days=["%m-%d"], months=["%m-%d"], years=["%m-%d"])
##Plot graph
df.plot(x='Month-Day',xformatter=formatter)#.opts(xticks=4, xrotation=90)
#Redox is on the right track here. The issue is with the way the Month-Day column is converted to a datetime; pandas is assuming the year is 1900 for every row.
Essentially you need to attach a year to the Month-Day in some way.
See the example below, this takes the first month-day string, prepends "2022-" and generates sequential daily values for every row (but there are a few ways of doing this).
code:
import pandas as pd
import numpy as np
import hvplot.pandas
from bokeh.models.formatters import DatetimeTickFormatter
dates = pd.date_range("2021-07-01", "2022-06-30", freq="D")
df = pd.DataFrame({
"md": dates.strftime("%m-%d"),
"ign": np.cumsum(np.random.normal(10, 5, len(dates))),
"sup": np.cumsum(np.random.normal(20, 10, len(dates))),
"imp": np.cumsum(np.random.normal(30, 15, len(dates))),
})
df["time"] = pd.date_range("2021-" + df.md[0], periods=len(df.index), freq="D")
formatter = DatetimeTickFormatter(
days=["%m-%d"], months=["%m-%d"], years=["%m-%d"])
df.hvplot(x='time', xformatter=formatter, y=['ign', 'sup', 'imp'],
width=900, ylabel="Index", rot=90, title="Cumulative ISI")
I have the following datasets
import pandas as pd
import numpy as np
df = pd.read_excel("https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet1")
df2 = pd.read_excel("https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet2")
df2.dropna(inplace = True)
For each group of values on the first df X-Axis Value, Y-Axis Value, where the first one is the date and the second one is a value, I would like to create rows with the same date. For instance, df.iloc[0,0] the timestamp is Timestamp('2020-08-25 23:14:12'). However, in the following columns of the same row maybe there is other dates with different Y-Axis Value associated. The first one in that specific row being X-Axis Value NCVE-064 HPNDE with a timestap 2020-08-25 23:04:12 and a Y-Axis Value associated of value 0.952.
What I want to accomplish is to interpolate those values for a time interval, maybe 10 minutes, and then merge those results to have the same date for each row.
For the df2 is moreless the same, interpolate the values in a time interval and add them to the original dataframe. Is there any way to do this?
The trick is to realize that datetimes can be represented as seconds elapsed with respect to some time.
Without further context part the hardest things is to decide at what times you wants to have the interpolated values.
import pandas as pd
import numpy as np
from scipy.interpolate import interp1d
df = pd.read_excel(
"https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet1",
)
x_columns = [col for col in df.columns if col.startswith("X-Axis")]
# What time do we want to align the columsn to?
# You can use anything else here or define equally spaced time points
# or something else.
target_times = df[x_columns].min(axis=1)
def interpolate_column(target_times, x_times, y_values):
ref_time = x_times.min()
# For interpolation we need to represent the values as floats. One options is to
# compute the delta in seconds between a reference time and the "current" time.
deltas = (x_times - ref_time).dt.total_seconds()
# repeat for our target times
target_times_seconds = (target_times - ref_time).dt.total_seconds()
return interp1d(deltas, y_values, bounds_error=False,fill_value="extrapolate" )(target_times_seconds)
output_df = pd.DataFrame()
output_df["Times"] = target_times
output_df["Y-Axis Value NCVE-063 VPNDE"] = interpolate_column(
target_times,
df["X-Axis Value NCVE-063 VPNDE"],
df["Y-Axis Value NCVE-063 VPNDE"],
)
# repeat for the other columns, better in a loop
i would like to change the index of my dataframe to datetime to sum the colum "Heizung" over a day.
But it dont work.
After i set the new index, i like to use resample to sum over a day.
Here is an extraction from my dataframe.
Nr;DatumZeit;Erdtemp;Heizung
0;25.04.21 12:58:42;21.8;1
1;25.04.21 12:58:54;21.8;1
2;25.04.21 12:59:06;21.9;1
3;25.04.21 12:59:18;21.9;1
4;25.04.21 12:59:29;21.9;1
5;25.04.21 12:59:41;22.0;1
6;25.04.21 12:59:53;22.0;1
7;25.04.21 13:00:05;22.1;1
8;25.04.21 13:00:16;22.1;0
9;25.04.21 13:00:28;22.1;0
10;25.04.21 13:00:40;22.1;0
11;25.04.21 13:00:52;22.2;0
12;25.04.21 13:01:03;22.2;0
13;25.04.21 13:01:15;22.2;1
14;25.04.21 13:01:27;22.2;1
15;25.04.21 13:01:39;22.3;1
16;25.04.21 13:01:50;22.3;1
17;25.04.21 13:02:02;22.4;1
18;25.04.21 13:02:14;22.4;1
19;25.04.21 13:02:26;22.4;0
20;25.04.21 13:02:37;22.4;1
21;25.04.21 13:02:49;22.4;0
22;25.04.21 13:03:01;22.4;0
23;25.04.21 13:03:13;22.5;0
24;25.04.21 13:03:25;22.4;0
This is my code
import pandas as pd
Tab = pd.read_csv('/home/kai/Dokumente/TempData', delimiter=';')
Tab1 = Tab[["DatumZeit","Erdtemp","Heizung"]].copy()
Tab1['DatumZeit'] = pd.to_datetime(Tab1['DatumZeit'])
Tab1.plot(x='DatumZeit', figsize=(20, 5),subplots=True)
#Tab1.index.to_datetime()
#Tab1.index = pd.to_datetime(Tab1.index)
Tab1.set_index('DatumZeit')
Tab.info()
Tab1.resample('D').sum()
print(Tab1.head(10))
This is how we can set index and create Timestamp object and then resample it for 'D' and sum a column over it.
Tab1['DatumZeit'] = pd.to_datetime(Tab1.DatumZeit)
Tab1 = Tab1.set_index('DatumZeit') ## missed here
Tab1.resample('D').Heizung.sum()
If we don't want to set index explicitly then other way to resample is pd.Grouper.
Tab1['DatumZeit'] = pd.to_datetime(Tab1.DatumZeit
Tab1.groupby(pd.Grouper(key='DatumZeit', freq='D')).Heizung.sum()
If we want output to be dataframe, then we can use to_frame method.
Tab1 = Tab1.groupby(pd.Grouper(key='DatumZeit', freq='D')).Heizung.sum().to_frame()
Output
Heizung
DatumZeit
2021-04-25 15
Pivot tables to the rescue:
import pandas as pd
import numpy as np
Tab1.pivot_table(index=["DatumZeit"], values=["Heizung"], aggfunc=np.sum)
If you need to do it with setting the index first, you need to use inplace=True on set_index
Tab1.set_index("DatumZeit", inplace=True)
Just note if you do this way, you can't go back to a pivot table. In the end, it's whatever works best for you.
I have a question pertaining to Pandas Data Frame which I want to enrich with Timings from Tick Source(kdb Table).
Pandas DataFrame
Date sym Level
2018-07-01 USDJPY 110
2018-08-01 GBPUSD 1.20
I want to enrich this dataframe with timings (first time for a given currency pair for a given date when the level is crossed).
from qpython import qconnection
from qpython import MetaData
from qpython.qtype import QKEYED_TABLE
from qpython.qtype import QSTRING_LIST, QINT_LIST,
QDATETIME_LIST,QSYMBOL_LIST
q.open()
df.meta = MetaData(sym = QSYMBOL_LIST, val = QINT_LIST, Date =
QDATE_LIST)
q('set', np.string_('tbl'), df)
The above code converts pandas dataframe to q table.
Example Code to Access tick data(kdb Tables)
select Mid by sym,date from quotestackevent where date = 2018.07.01, sym = `CCYPAIR
How can I use dataframe columns sym and date to pull data from kdb tables using Qpython?
Suppose on the KDB+ side you have a table t with columns sym (of type symbol), date (of type date), and mid (of type float), for example generated by the following code:
t:`date xasc ([] sym:raze (3#) each `USDJPY`GBPUSD`EURBTC;date:9#.z.d-til 3;mid:9?`float$10)
Then to bring the data for enrichment from the KDB+ side to the Python side you can do the following:
from qpython import qconnection
import pandas as pd
df = pd.DataFrame({'Date': ['2018-09-08','2018-09-08','2018-09-07','2018-09-07'],'sym':['abc','def','abc','def']})
df['Date']=df['Date'].astype('datetime64[ns]')
with qconnection.QConnection(host = 'localhost', port = 5001, pandas = True) as q:
X = q.sync('{select sym,date,mid from t where date in `date$x}',df['Date'])
Here the first argument to q.sync() defines a function to be executed and the second argument is the range of dates you want to get from the table t. Inside the function the `date$x part converts the argument to a list of dates, which is needed because df['Date'] is sent as a list of timestamps to the KDB+ side.
The resulting X data frame will have the sym column as binary strings, so you may want to do something like
X['sym'].apply(lambda x: x.decode('ascii'))
to convert that to strings.
An alternative to sending the function definition is to have a function defined on the KDB+ side and send only its name from the Python side. So, if you can do something like
getMids:{select sym,date,mid from t where date in `date$x}
on the KDB+ side, then you can do
X = q.sync('getMids',df['Date'])
instead of sending the function definition.