I have hourly stock data.
I need a) to format it so that matplotlib ignores weekends and non-business hours and b) an hourly frequency.
The problem:
Currently, the graph looks crammed and I suspect it is because matplotlib is taking into account 24 hours instead of 8, and 7 days a week instead of business days.
How do I tell pandas to only take into account business hours, M- F?
How I am graphing the data:
I am looping through a list of price data dataframes, graphing each data frame:
mm = 0
for ii in df:
Ddate = ii['Date']
Pprice = ii['Price']
d = Ddate.to_list()
p = Pprice.to_list()
dates = make_dt(d)
prices = unstring(p)
plt.figure()
plt.plot(dates,prices)
plt.title(stocks[mm])
plt.grid(True)
plt.xlabel('Dates')
plt.ylabel('Prices')
mm += 1
the graph:
To fetch business days, you can use below function:
df["IsBDay"] = bool(len(pd.bdate_range(df['date'], df['date'])))
//Above line should add a new column into the DF as IsBday.
//You can also use Lambda expression to check and have new column for BDay.
df['IsBDay'] = df['date'].apply(lambda x: 'True' if bool(len(pd.bdate_range(x, x))) else 'False')
Now create a new DF that will have only True IsBday column value and other columns.
df[df.IsBday != 'False']
Now your DF is ready for ploting.
Hope this helps.
Related
I have a following dataset:
I would like to get a result as follows:
The goal is to calculate duration per "Level" column.
Dataset:
import pandas as pd
from datetime import datetime, date
data = {'Time': ["08:35:00", "08:40:00", "08:45:00", "08:55:00", "08:57:00", "08:59:00"],
'Level': [250, 250, 250, 200, 200, 200]}
df = pd.DataFrame(data)
df['Time'] = pd.to_datetime(df['Time'],format= '%H:%M:%S' ).dt.time
Difference between two datetimes i am able to calculate with the code:
t1 = df['Time'].iloc[0]
t2 = df['Time'].iloc[1]
c = datetime.combine(date.today(), t2) - datetime.combine(date.today(), t1)
But i am not able to "automate" the calculation. This code works the only for integers.
df2 = df.groupby('Level').apply(lambda x: x.Time.max() - x.Time.min())
If you keep the date part of Time, the calculation is a lot easier:
df = pd.DataFrame(data)
# Keep the date part, even though it's meaningless
df["Time"] = pd.to_datetime(df["Time"], format="%H:%M:%S")
def to_string(duration: pd.Timedelta) -> str:
total = duration.total_seconds()
hours, remainder = divmod(total, 3600)
minutes, seconds = divmod(remainder, 60)
return f"{hours:02.0f}:{minutes:02.0f}:{seconds:02.0f}"
level = df["Level"]
# CAUTION: avoid calling to_string until the very last step,
# when you need to display your result. There's not many
# calculations you can do with strings.
df["Time"].groupby(level).diff().groupby(level).sum().apply(to_string)
I am attempting to transfer my team's Eview code to Python and I got stock with the following line in Eviews:
equation eq_LSTrend.ls(cov=hac) log({Price})=c(1) * #trend + c(2).
Here, the time regression analysis of a certain time window is to be performed on the log(price) and the intercept c(1) as well as the slope c(2) have to be determined.
Let's say I have the following df:
import pandas as pd
Range = pd.date_range('1990-01-01', periods=8, freq='D')
log_price = [5.0835, 5.0906, 5.0946, 5.0916, 5.0825, 5.0833, 5.0782, 5.0709]
df = pd.DataFrame({ 'Date': Range, 'Log Price': log_price })
df.set_index('Date', inplace=True)
And the df looks like this:
Date Log Price
1990-01-01 5.0835
1990-01-02 5.0906
1990-01-03 5.0946
1990-01-04 5.0916
1990-01-05 5.0825
1990-01-06 5.0833
1990-01-07 5.0782
1990-01-08 5.0709
How could I, for example, take a rolling 5 period window, do a OLS or GLS analysis and get the wanted parameters (the slope and the intercept parameters?)
Also, which library would be appropriate for it (statsmodels or maybe some other library)?
Ideally, the code would look something like this:
df_window = df.rolling(window = 5)
slope_output = sm.GLS(df_window).slope
or if separate columns have to be provided as an input (in this case I would leave "Date" as a separate column in df)
df_window = df.rolling(window = 5)
slope_output = sm.GLS(depend_var = df_window["Log Price"], independ_var = df_window["Date"]).slope
I am quite new to python so please pardon my bad coding!
I am looking to perform a fast operation on flightradar data to see if the speed in distance matches the speed reported. I have multiple flights and was told not to run double loops on pandas dataframes. Here is a sample dataframe:
import pandas as pd
from datetime import datetime
from shapely.geometry import Point
from geopy.distance import distance
dates = ['2020-12-26 15:13:01', '2020-12-26 15:13:07','2020-12-26 15:13:19','2020-12-26 15:13:32','2020-12-26 15:13:38']
datetimes = [datetime.fromisoformat(date) for date in dates]
data = {'UTC': datetimes,
'Callsign': ["1", "1","2","2","2"],
'Position':[Point(30.542175,-91.13999200000001), Point(30.546204,-91.14020499999999),Point(30.551443,-91.14417299999999),Point(30.553909,-91.15136699999999),Point(30.554489,-91.155075)]
}
df = pd.DataFrame(data)
What I want to do is add a new column called "dist". This column will be 0 if it is the first element of a new callsign but if not it will be the distance between a point and the previous point.
The resulting df should look like this:
df1 = df
dist = [0,0.27783309075379214,0,0.46131362750613436,0.22464461718704595]
df1['dist'] = dist
What I have tried is to first assign a group index:
df['group_index'] = df.groupby('Callsign').cumcount()
Then groupby
Then try and apply the function:
df['dist'] = df.groupby('Callsign').apply(lambda g: 0 if g.group_index == 0 else distance((g.Position.x , g.Position.y),
(g.Position.shift().x , g.Position.shift().y)).miles)
I was hoping this would give me the 0 for the first index of each group and then run the distance function on all others and return a value in miles. However it does not work.
The code errors out for at least one reason which is because the .x and .y attributes of the shapely object are being called on the series rather than the object.
Any ideas on how to fix this would be much appreciated.
Sort df by callsign then timestamp
Compute distances between adjacent rows using a temporary column of shifted points
For the first row of each new callsign, set distance to 0
Drop temporary column
df = df.sort_values(by=['Callsign', 'UTC'])
df['Position_prev'] = df['Position'].shift().bfill()
def get_dist(row):
return distance((row['Position'].x, row['Position'].y),
(row['Position_prev'].x, row['Position_prev'].y)).miles
df['dist'] = df.apply(get_distances, axis=1)
# Flag row if callsign is different from previous row callsign
new_callsign_rows = df['Callsign'] != df['Callsign'].shift()
# Zero out the first distance of each callsign group
df.loc[new_callsign_rows, 'dist'] = 0.0
# Drop shifted column
df = df.drop(columns='Position_prev')
print(df)
UTC Callsign Position dist
0 2020-12-26 15:13:01 1 POINT (30.542175 -91.13999200000001) 0.000000
1 2020-12-26 15:13:07 1 POINT (30.546204 -91.14020499999999) 0.277833
2 2020-12-26 15:13:19 2 POINT (30.551443 -91.14417299999999) 0.000000
3 2020-12-26 15:13:32 2 POINT (30.553909 -91.15136699999999) 0.461314
4 2020-12-26 15:13:38 2 POINT (30.554489 -91.155075) 0.224645
I would like to compare the values from different weeks in different groups. Something like daily sales for two team members by week to demonstrate the effect of one person being off/a holiday etc. The time of the sale within each day needs to be ordered within the day but the x axis should be labeled by day.
Example is arbitrary.
Example data and output
stringsAsFactors =FALSE
library(lubridate)
library(tidyverse)
library(magrittr)
#=======================
# Week on week comparison of days by a group
#=======================
# Generate DF
Date <- data.frame(Date = rep(seq(as.Date("2020-04-01"),as.Date("2020-04-14"),by="days"),4))
Time <- data.frame(Time = c(rep("00:00:01",nrow(Date)/2),rep("00:00:02",nrow(Date)/2)))
Type <- data.frame(Type = rep(c(rep("a",nrow(Date)/4),rep("b",nrow(Date)/4)),2))
df <- cbind(Date,Time,Type)
# Add random values to plot
df %<>% mutate(values = runif(nrow(.),1,10))
# Create a groups for weeks, orders for days and labels as weekdays (char strings).
df %<>% mutate(weekLevel = week(Date),
dayLevel = wday(Date),
Day = as.character(weekdays(Date)),
orderVar = paste0(dayLevel, Time))
ggplot(df %>% arrange(orderVar), aes(x = orderVar, y = values,group = interaction(Type,weekLevel),colour=Type))+
geom_line()+
scale_x_discrete(breaks =df$orderVar , labels = df$Day) +
theme(axis.text.x = element_text(angle = 90, hjust=1))
This works but the day is repeated because the breaks are set to a more granular level than the labels. It also feels a bit hacky.
Any and all feedback is appreciate :)
So I have a have a table (~2000 rows, call it df1) of when a particular subject received a medication on a particular date, and I have a large excel file (>1 million rows) of weight data for subjects for different dates (call it df2).
AIM: I want to group by subject and find the weight in df2 that was recorded closest to the medication admin time in df1 using sqldf(because tables are too big to load into R). Or alternatively, I can set up a time frame of interest (e.g. +/- 1 week of medication given) and find a row that falls within that timeframe.
Example:
df1 <- data.frame(
PtID = rep(c(1:5), each=2),
Dose = rep(seq(100,200,25),2),
ADMIN_TIME =seq.Date(as.Date("2016/01/01"), by = "month", length.out = 10)
)
df2 <- data.frame(
PtID = rep(c(1:5),each=10),
Weight = rnorm(50, 50, 10),
Wt_time = seq.Date(as.Date("2016/01/01"), as.Date("2016/10/31"), length.out = 50)
)
So I think i want to left_join df1 and df2, group by PtID, and set up some condition that identifies either the closest df2$Weight to the df1$Admin_time or a df2$Weight within an acceptable range around df1$Admin_time using sql formatting.
So I tried creating a range and then querying the following:
library(dplry)
library(lubridate)
df1 <- df1 %>%
mutate(ADMIN_START = ADMIN_TIME - ddays(30),
ADMIN_END = ADMIN_TIME + ddays(30))
#df2.csv is the large spreadsheet saved in my working directory
result <- read.csv.sql("df2.csv", sql = "select Weight from file
left join df1
on file.Wt_time between df1.ADMIN_START and df1.ADMIN_END")
This will run but it never results anything and I have to escape out of it. Any thoughts are appreciated.
Thanks!