I would like to compare the values from different weeks in different groups. Something like daily sales for two team members by week to demonstrate the effect of one person being off/a holiday etc. The time of the sale within each day needs to be ordered within the day but the x axis should be labeled by day.
Example is arbitrary.
Example data and output
stringsAsFactors =FALSE
library(lubridate)
library(tidyverse)
library(magrittr)
#=======================
# Week on week comparison of days by a group
#=======================
# Generate DF
Date <- data.frame(Date = rep(seq(as.Date("2020-04-01"),as.Date("2020-04-14"),by="days"),4))
Time <- data.frame(Time = c(rep("00:00:01",nrow(Date)/2),rep("00:00:02",nrow(Date)/2)))
Type <- data.frame(Type = rep(c(rep("a",nrow(Date)/4),rep("b",nrow(Date)/4)),2))
df <- cbind(Date,Time,Type)
# Add random values to plot
df %<>% mutate(values = runif(nrow(.),1,10))
# Create a groups for weeks, orders for days and labels as weekdays (char strings).
df %<>% mutate(weekLevel = week(Date),
dayLevel = wday(Date),
Day = as.character(weekdays(Date)),
orderVar = paste0(dayLevel, Time))
ggplot(df %>% arrange(orderVar), aes(x = orderVar, y = values,group = interaction(Type,weekLevel),colour=Type))+
geom_line()+
scale_x_discrete(breaks =df$orderVar , labels = df$Day) +
theme(axis.text.x = element_text(angle = 90, hjust=1))
This works but the day is repeated because the breaks are set to a more granular level than the labels. It also feels a bit hacky.
Any and all feedback is appreciate :)
Related
I have a following dataset:
I would like to get a result as follows:
The goal is to calculate duration per "Level" column.
Dataset:
import pandas as pd
from datetime import datetime, date
data = {'Time': ["08:35:00", "08:40:00", "08:45:00", "08:55:00", "08:57:00", "08:59:00"],
'Level': [250, 250, 250, 200, 200, 200]}
df = pd.DataFrame(data)
df['Time'] = pd.to_datetime(df['Time'],format= '%H:%M:%S' ).dt.time
Difference between two datetimes i am able to calculate with the code:
t1 = df['Time'].iloc[0]
t2 = df['Time'].iloc[1]
c = datetime.combine(date.today(), t2) - datetime.combine(date.today(), t1)
But i am not able to "automate" the calculation. This code works the only for integers.
df2 = df.groupby('Level').apply(lambda x: x.Time.max() - x.Time.min())
If you keep the date part of Time, the calculation is a lot easier:
df = pd.DataFrame(data)
# Keep the date part, even though it's meaningless
df["Time"] = pd.to_datetime(df["Time"], format="%H:%M:%S")
def to_string(duration: pd.Timedelta) -> str:
total = duration.total_seconds()
hours, remainder = divmod(total, 3600)
minutes, seconds = divmod(remainder, 60)
return f"{hours:02.0f}:{minutes:02.0f}:{seconds:02.0f}"
level = df["Level"]
# CAUTION: avoid calling to_string until the very last step,
# when you need to display your result. There's not many
# calculations you can do with strings.
df["Time"].groupby(level).diff().groupby(level).sum().apply(to_string)
I'm working with a large df trying to make some plots by filterig data through different attributes of interest. Let's say my df looks like:
df(site=c(A,B,C,D,E), subsite=c(w,x,y,z), date=c(01/01/1985, 05/01/1985, 16/03/1995, 24/03/1995), species=c(1,2,3,4), Year=c(1985,1990,1995,2012), julian day=c(1,2,3,4), Month=c(6,7,8,11).
I would like plot the average julian day per month each year in which a species was present in a Subsite and Site. So far I've got this code but the average has been calculated for each month over all the years in my df rather than per year. Any help/ directions would be welcome!
Plot1<- df %>%
filter(Site=="A", Year>1985, Species =="2")%>%
group_by(Month) %>%
mutate("Day" = mean(julian day)) %>%
ggplot(aes(x=Year, y=Day, color=Species)) +
geom_boxplot() +
stat_summary(fun=mean, geom="point",
shape=1, size=1, show.legend=FALSE) +
stat_summary(fun=mean, colour="red", geom="text", show.legend = FALSE,
vjust=-0.7,size=3, aes(label=round(..y.., digits=0)))
Thanks!
I think I spotted the error.
I was missing this:
group_by(Month, **Year**) %>%
#plot data
fig, ax = plt.subplots(figsize=(25,17))
plt.ylabel('No of tweets', fontsize=12)
#plt.xlim([1,20])
plt.title('Number of tweets', fontsize = 20)
data.sort_values(by = ['Year','Month'], ascending=[True,True]).groupby(['Month','Year']).count()['text'].plot(ax=ax)
plt.xlabel('Month-Year', fontsize=12)
I have attached the current output here
Can you help me understand what I'm going wrong?
Combine Year and Month columns to a new column Date
data['Date'] = data['Year'].astype('str') + '-' \
+ data['Month'].astype('str').str.zfill(2)
# Groupby sort groups by default
data.groupby('Date')['text'].count().plot(ax=ax)
I have hourly stock data.
I need a) to format it so that matplotlib ignores weekends and non-business hours and b) an hourly frequency.
The problem:
Currently, the graph looks crammed and I suspect it is because matplotlib is taking into account 24 hours instead of 8, and 7 days a week instead of business days.
How do I tell pandas to only take into account business hours, M- F?
How I am graphing the data:
I am looping through a list of price data dataframes, graphing each data frame:
mm = 0
for ii in df:
Ddate = ii['Date']
Pprice = ii['Price']
d = Ddate.to_list()
p = Pprice.to_list()
dates = make_dt(d)
prices = unstring(p)
plt.figure()
plt.plot(dates,prices)
plt.title(stocks[mm])
plt.grid(True)
plt.xlabel('Dates')
plt.ylabel('Prices')
mm += 1
the graph:
To fetch business days, you can use below function:
df["IsBDay"] = bool(len(pd.bdate_range(df['date'], df['date'])))
//Above line should add a new column into the DF as IsBday.
//You can also use Lambda expression to check and have new column for BDay.
df['IsBDay'] = df['date'].apply(lambda x: 'True' if bool(len(pd.bdate_range(x, x))) else 'False')
Now create a new DF that will have only True IsBday column value and other columns.
df[df.IsBday != 'False']
Now your DF is ready for ploting.
Hope this helps.
So I have a have a table (~2000 rows, call it df1) of when a particular subject received a medication on a particular date, and I have a large excel file (>1 million rows) of weight data for subjects for different dates (call it df2).
AIM: I want to group by subject and find the weight in df2 that was recorded closest to the medication admin time in df1 using sqldf(because tables are too big to load into R). Or alternatively, I can set up a time frame of interest (e.g. +/- 1 week of medication given) and find a row that falls within that timeframe.
Example:
df1 <- data.frame(
PtID = rep(c(1:5), each=2),
Dose = rep(seq(100,200,25),2),
ADMIN_TIME =seq.Date(as.Date("2016/01/01"), by = "month", length.out = 10)
)
df2 <- data.frame(
PtID = rep(c(1:5),each=10),
Weight = rnorm(50, 50, 10),
Wt_time = seq.Date(as.Date("2016/01/01"), as.Date("2016/10/31"), length.out = 50)
)
So I think i want to left_join df1 and df2, group by PtID, and set up some condition that identifies either the closest df2$Weight to the df1$Admin_time or a df2$Weight within an acceptable range around df1$Admin_time using sql formatting.
So I tried creating a range and then querying the following:
library(dplry)
library(lubridate)
df1 <- df1 %>%
mutate(ADMIN_START = ADMIN_TIME - ddays(30),
ADMIN_END = ADMIN_TIME + ddays(30))
#df2.csv is the large spreadsheet saved in my working directory
result <- read.csv.sql("df2.csv", sql = "select Weight from file
left join df1
on file.Wt_time between df1.ADMIN_START and df1.ADMIN_END")
This will run but it never results anything and I have to escape out of it. Any thoughts are appreciated.
Thanks!