Data handling on multiple Heart rate files - heartbeat

I have been collecting the Heart rates of 12 calves who each received an anesthetic through four different routes of administration. I now have 48 txt files of this format:
Time HRbpm
0:00:01.7 97
0:00:02.3 121
0:00:02.8 15
... ...
HR was recorded for around 2hours. The Time column was dependent of the monitor, resulting in inconsistent time intervals between two measures.
The txt files are named as follows: 6133_IM_27.00.txt
With 6133 being the ID, IM the route and 27.00 the time (min:min.s:s) at which the treatment was injected.
My first goal is to have all the HR data so I can do an outlier analysis.
Then, I would like to include all this data in a single data frame that would look like this:
data.frame(ID=c(6133,6133,6133,6133,"...",6134,6134,"..."),
Route = c("IM","IM","IM","IM","...","SC","SC","..."),
time=c(0, 10, 20, 30,"...",0,10,"..."),
HR=c(160, 150, 145, 130,"...",162,158,"..."))
Time column going from 0 to 120 in 10min increments.
Each HR of this df would represent the mean of the HR values for the preceding minute for a given time (e.g. for time = 30, HR would represent the mean between 29 and 30 minutes for a given ID/Route combination).
I'm fairly new to R, so I've been having trouble just knowing by what angle starting on that problem. Any help would be welcome.
Thanks,
Thomas

For anyone who stumbles on this post, here's what I've done, seems to be working.
library(plyr)
library(reshape)
library(ggplot2)
setwd("/directory")
filelist = list.files(pattern = ".*.txt")
datalist = lapply(filelist, read.delim)
for (i in 1:length(datalist))
{datalist[[i]][3] = filelist[i]}
df = do.call("rbind", datalist)
attach(df)
out_lowHR = quantile(HRbpm,0.25)-1.5*IQR(HRbpm)
out_highHR = quantile(HRbpm,0.75)+1.5*IQR(HRbpm) #outliers thresholds: 60 and 200
dfc = subset(df,HRbpm>=60 & HRbpm<=200)
(length(df$HRbpm)-length(dfc$HRbpm))/length(df$HRbpm)*100 #8.6% of values excluded
df = dfc
df$ID = substr(df$V3,4,7)
df$ROA = substr(df$V3,9,11)
df$ti = substr(df$V3,13,17)
df$Time = as.POSIXct(as.character(df$Time), format="%H:%M:%S")
df$ti = as.POSIXct(as.character(df$ti), format="%M.%S")
df$t = as.numeric(df$Time-df$ti)
m=60
meanHR = ddply(df, c("ROA","ID"), summarise,
mean0 = mean(HRbpm[t>-60*m & t <=0]),
mean10 = mean(HRbpm[t>9*m & t <=10*m]),
mean20 = mean(HRbpm[t>19*m & t <=20*m]),
mean30 = mean(HRbpm[t>29*m & t <=30*m]),
mean45 = mean(HRbpm[t>44*m & t <=45*m]),
mean60 = mean(HRbpm[t>59*m & t <=60*m]),
mean90 = mean(HRbpm[t>89*m & t <=90*m]),
mean120 = mean(HRbpm[t>119*m & t <=120*m]))
meanHR = melt(meanHR)
meanHR$time = as.numeric(gsub("mean", "", meanHR$variable))
ggplot(meanHR, aes(x = time, y = value, col = ROA))+
geom_smooth()+
theme_classic()

Related

Csv file search speedup

I need to build a relief profile graph by coordinates, I have a csv file with 12,000,000 lines. searching through a csv file of the same height takes about 2 - 2.5 seconds. I rewrote the csv to parquet and it helped me save some time, it takes about 1.7 - 1 second to find one height. However, I need to build a profile for 500 - 2000 values, which makes the time very long. In the future, you may have to increase the base of the csv file, which will slow down this process even more. In this regard, my question is, is it possible to somehow reduce the processing time of values?
Code example:
import dask.dataframe as dk
import numpy as np
import pandas as pd
import time
filename = 'n46_e032_1arc_v3.csv'
df = dk.read_csv(filename)
df.to_parquet('n46_e032_1arc_v3_parquet')
Latitude1y, Longitude1x = 46.6276, 32.5942
Latitude2y, Longitude2x = 46.6451, 32.6781
sec, steps, k = 0.00027778, 1, 11.73
Latitude, Longitude = [Latitude1y], [Longitude1x]
sin, cos = Latitude2y - Latitude1y, Longitude2x - Longitude1x
y, x = Latitude1y, Longitude1x
while Latitude[-1] < Latitude2y and Longitude[-1] < Longitude2x:
y, x, steps = y + sec * k * sin, x + sec * k * cos, steps + 1
Latitude.append(y)
Longitude.append(x)
time_start = time.time()
long, elevation_data = [], []
df2 = dk.read_parquet('n46_e032_1arc_v3_parquet')
for i in range(steps + 1):
elevation_line = df2[(Longitude[i] <= df2['x']) & (df2['x'] <= Longitude[i] + sec) &
(Latitude[i] <= df2['y']) & (df2['y'] <= Latitude[i] + sec)].compute()
elevation = np.asarray(elevation_line.z.tolist())
if elevation[-1] < 0:
elevation_data.append(0)
else:
elevation_data.append(elevation[-1])
long.append(30 * i)
plt.bar(long, elevation_data, width = 30)
plt.show()
print(time.time() - time_start)
Here's one way to solve this problem using KD trees. A KD tree is a data structure for doing fast nearest-neighbor searches.
import scipy.spatial
tree = scipy.spatial.KDTree(df[['x', 'y']].values)
elevations = df['z'].values
long, elevation_data = [], []
for i in range(steps):
lon, lat = Longitude[i], Latitude[i]
dist, idx = tree.query([lon, lat])
elevation = elevations[idx]
if elevation < 0:
elevation = 0
elevation_data.append(elevation)
long.append(30 * i)
Note: if you can make assumptions about the data, like "all of the points in the CSV are equally spaced," faster algorithms are possible.
It looks like your data might be on a regular grid. If (and only if) every combination of x and y exist in your data, then it probably makes sense to turn this into a labeled 2D array of points, after which querying the correct position will be very fast.
For this, I'll use xarray, which is essentially pandas for N-dimensional data, and integrates well with dask:
# bring the dataframe into memory
df = dk.read('n46_e032_1arc_v3_parquet').compute()
da = df.set_index(["y", "x"]).z.to_xarray()
# now you can query the nearest points:
desired_lats = xr.DataArray([46.6276, 46.6451], dims=["point"])
desired_lons = xr.DataArray([32.5942, 32.6781], dims=["point"])
subset = da.sel(y=desired_lats, x=desired_lons, method="nearest")
# if you'd like, you can return to pandas:
subset_s = subset.to_series()
# you could do this only once, and save the reshaped array as a zarr store:
ds = da.to_dataset(name="elevation")
ds.to_zarr("n46_e032_1arc_v3.zarr")

efficient way to join 65,000 .csv files

I have say 65,000 .csv files that I need to work with in julia language.
The goal is to perform basic statistics on the data set.
I had some ways of joining all the data sets
#1 - set a common index and leftjoin() - perform statistics row wise
#2 - vcat() the dataframes on top of each other - vertically stacked use group by
Eitherway the final data frames are very large ! and become slow in processing
Is there an efficient way of doing this ?
I thought of performing either #1 or #2 and splitting the joining operations in thirds, lets say after 20,000 joins save to .csv and operate in chunks then at the end join all 3 in one last operation.
Well not sure how to replicate making 65k .csv files but basically below I loop through the files in the directory, load the csv then vcat() to one df. Question more relating to if there is a better way to manage the size of the operation. vcat() makes something grow. Ahead of time maybe I can cycle through the .csv files, obtain file dimensions per .csv, initialize the full dataframe to final output size, then cycle through each .csv row by row and populate the initialized df.
using CSV
using DataFrames
# read all files in directory
csv_dir_tmax = cd(readdir, "C:/Users/andrew.bannerman/Desktop/Julia/scripts/GHCN data/ghcnd_all_csv/tmax")
# initialize outputs
tmax_all = DataFrame(Date = [], TMAX = [])
c=1
for c = 1:length(csv_dir_tmax)
print("Starting csv file ", csv_dir_tmax[c]," - Iteration ",c,"\n")
if c <= length(csv_dir_tmax)
csv_tmax = CSV.read(join(["C:/Users/andrew.bannerman/Desktop/Julia/scripts/GHCN data/ghcnd_all_csv/tmax/", csv_dir_tmax[c]]), DataFrame, header=true)
tmax_all = vcat(tmax_all, csv_tmax)
end
end
The following approach should be relatively efficient (assuming that data fits into memory):
tmax_all = reduce(vcat, [CSV.read("YOUR_DIR$x", DataFrame) for x in csv_dir_tmax])
initializing the final output to the total size of final output (like vcat() would finally build). Then populate it element wise seems to be working way better:
# get the dimensions of each .csv files
tmax_all_total_output_size = fill(0, size(csv_dir_tmax,1))
tmin_all_total_output_size = fill(0, size(csv_dir_tmin,1))
tavg_all_total_output_size = fill(0, size(csv_dir_tavg,1))
tmax_dim = Int64[]
tmin_dim = Int64[]
tavg_dim = Int64[]
c=1
for c = 1:length(csv_dir_tmin) # 47484 - last point
print("Starting csv file ", csv_dir_tmin[c]," - Iteration ",c,"\n")
if c <= length(csv_dir_tmax)
tmax_csv = CSV.read(join(["C:/Users/andrew.bannerman/Desktop/Julia/scripts/GHCN data/ghcnd_all_csv/tmax/", csv_dir_tmax[c] ]), DataFrame, header=true)
global tmax_dim = size(tmax_csv,1)
tmax_all_total_output_size[c] = tmax_dim
end
if c <= length(csv_dir_tmin)
tmin_csv = CSV.read(join(["C:/Users/andrew.bannerman/Desktop/Julia/scripts/GHCN data/ghcnd_all_csv/tmin/", csv_dir_tmin[c]]), DataFrame, header=true)
global tmin_dim = size(tmin_csv,1)
tmin_all_total_output_size[c] = tmin_dim
end
if c <= length(csv_dir_tavg)
tavg_csv = CSV.read(join(["C:/Users/andrew.bannerman/Desktop/Julia/scripts/GHCN data/ghcnd_all_csv/tavg/", csv_dir_tavg[c]]), DataFrame, header=true)
global tavg_dim = size(tavg_csv,1)
tavg_all_total_output_size[c] = tavg_dim
end
end
# sum total dimension of all .csv files
tmax_sum = sum(tmax_all_total_output_size)
tmin_sum = sum(tmin_all_total_output_size)
tavg_sum = sum(tavg_all_total_output_size)
# initialize final output to total final dimension
tmax_date_array = fill(Date("13000101", "yyyymmdd"),tmax_sum)
tmax_array = zeros(tmax_sum)
tmin_date_array = fill(Date("13000101", "yyyymmdd"),tmin_sum)
tmin_array = zeros(tmin_sum)
tavg_date_array = fill(Date("13000101", "yyyymmdd"),tavg_sum)
tavg_array = zeros(tavg_sum)
# initialize outputs
tmax_all = DataFrame(Date = tmax_date_array, TMAX = tmax_array)
tmin_all = DataFrame(Date = tmin_date_array, TMIN = tmin_array)
tavg_all = DataFrame(Date = tavg_date_array, TAVG = tavg_array)
tmax_count = 0
tmin_count = 0
tavg_count = 0
Then begin filling the initialized df.

Efficient way to expand a DataFrame in Julia

I have a dataframe with exposure episodes per case:
using DataFrames
using Dates
df = DataFrame(id = [1,1,2,3], startdate = [Date(2018,3,1),Date(2019,4,2),Date(2018,6,4),Date(2018,5,1)], enddate = [Date(2019,4,4),Date(2019,8,5),Date(2019,3,1),Date(2019,4,15)])
I want to expand each episode to its constituent days, eliminating any duplicate days per case resulting from overlapping episodes (case 1 in the example dataframe):
s = similar(df, 0)
for row in eachrow(df)
tf = DataFrame(row)
ttf = repeat(tf, Dates.value.(row.enddate - row.startdate) + 1)
ttf.daydate = ttf.startdate .+ Dates.Day.(0:nrow(ttf) - 1) #a record for each day between start and end days (inclusive)
ttf.start = ttf.daydate .== ttf.startdate #a flag to indicate this record was at the start of an episode
ttf.end = ttf.daydate .== ttf.enddate #a flag to indicate this record was at the end of an episode
append!(s, ttf, cols=:union)
end
sort!(s, [:id,:daydate,:startdate, order(:enddate, rev=true)])
unique!(s,[:id,:daydate]) #to eliminate duplicate dates in the case of episode overlaps (e.g. case 1)
I have a strong suspicion that there is a more efficient way of doing this than the brute force method I came up with and any help will be appreciated.
Implementation note: In the actual implementation there are several hundred thousand cases, each with relatively few episodes (median = 1, 75 percentile 3, maximum 20), but spanning 20 years or more of exposure resulting in a very large dataset (several 100 million records). To fit into available memory I have partitioned the dataset on id and used the Threads.#threads macro to loop through the partitions in parallel. The primary purpose of this decomposition into days is not just to eliminate overlaps, but to join the data with other exposure data that is available on a per day basis.
Below is a more complete solution that takes into account some essential details. Each episode is associated with additional attributes, as an example I used locationid (place where the exposure took place) and the need to indicate whether there was a gap between subsequent episodes. The original solution also did not cater for the special case where an episode is fully contained within another episode - such episodes should not be expanded.
using Dates
using DataFrames
function process(startdate, enddate, locationid)
start = startdate[1]
stop = enddate[1]
location = locationid[1]
res_daydate = collect(start:Day(1):stop)
res_startdate = fill(start, length(res_daydate))
res_enddate = fill(stop, length(res_daydate))
res_location = fill(location, length(res_daydate))
gap = 0
res_gap = fill(0, length(res_daydate))
for i in 2:length(startdate)
if startdate[i] > res_daydate[end]
start = startdate[i]
elseif enddate[i] > res_daydate[end]
start = res_daydate[end] + Day(1)
else
continue #this episode is contained within the previous episode
end
if start - res_daydate[end] > Day(1)
gap = gap==0 ? 1 : 0
end
stop = enddate[i]
location = locationid[i]
new_daydate = start:Day(1):stop
append!(res_daydate, new_daydate)
append!(res_startdate, fill(startdate[i], length(new_daydate)))
append!(res_enddate, fill(stop, length(new_daydate)))
append!(res_location, fill(location, length(new_daydate)))
append!(res_gap, fill(gap, length(new_daydate)))
end
return (daydate=res_daydate, startdate=res_startdate, enddate=res_enddate, locationid=res_location, gap = res_gap)
end
function eliminateoverlap()
df = DataFrame(id = [1,1,2,3,3,4,4], startdate = [Date(2018,3,1),Date(2019,4,2),Date(2018,6,4),Date(2018,5,1), Date(2019,5,1), Date(2012,1,1), Date(2012,2,2)],
enddate = [Date(2019,4,4),Date(2019,8,5),Date(2019,3,1),Date(2019,4,15),Date(2019,6,15),Date(2012,6,30), Date(2012,2,10)], locationid=[10,11,21,30,30,40,41])
dfs = sort(df, [:startdate, order(:enddate, rev=true)])
gdf = groupby(dfs, :id, sort=true)
r = combine(gdf, [:startdate, :enddate, :locationid] => process => AsTable)
df = combine(groupby(r, [:id,:gap,:locationid]), :daydate => minimum => :StartDate, :daydate => maximum => :EndDate)
return df
end
df = eliminateoverlap()
Here is something that should be efficient:
dfs = sort(df, [:startdate, order(:enddate, rev=true)])
gdf = groupby(dfs, :id, sort=true)
function process(startdate, enddate)
start = startdate[1]
stop = enddate[1]
res_daydate = collect(start:Day(1):stop)
res_startdate = fill(start, length(res_daydate))
res_enddate = fill(stop, length(res_daydate))
for i in 2:length(startdate)
if startdate[i] > res_daydate[end]
start = startdate[i]
stop = enddate[i]
elseif enddate[i] > res_daydate[end]
start = res_daydate[end] + Day(1)
stop = enddate[i]
end
new_daydate = start:Day(1):stop
append!(res_daydate, new_daydate)
append!(res_startdate, fill(startdate[i], length(new_daydate)))
append!(res_enddate, fill(stop, length(new_daydate)))
end
return (startdate=res_startdate, enddate=res_enddate, daydate=res_daydate)
end
combine(gdf, [:startdate, :enddate] => process => AsTable)
(but please check it with larger data against your implementation if it is correct as I have just written it quickly to show you how to do performant implementations with DataFrames.jl)

How to extract stat_smooth curve maxima in gpplot panel (facet_grid)?

I have created this plot with 18 grids using facet_grid command and two different fitting equations (for Jan - Apr, and May - Jun). I have two things that I need help with:
(may sound obvious, but) I haven't been able to find on the internet working codes extract a curve maximum for a stat_smooth fit. I'd appreciate if someone could show and explain what the codes mean. This is the closest I could find, but I am not sure what it means:
gb <- ggplot_build(p1)
curve_max <- gb$data[[1]]$x[which(diff(sign(diff(gb$data[[1]]$y)))==-2)+1]
How to add a vertical line to indicate max value on each curve?
Data file (rlc2 <- read_excel)
Plot
plot <- ggplot(rlc2, aes(par, etr, color=month, group=site))+
geom_point()+
stat_smooth(data = subset(rlc2, rlc2$month!="May" & rlc2$month!="Jun"),
method = "glm",
formula = y ~ x + log(x),
se = FALSE,
method.args = list(family = gaussian(link = "log"), start=c(a=0, b=0, c=0)))+
stat_smooth(data = subset(rlc2, rlc2$month=="May" | rlc2$month=="Jun"),
method = "nlsLM",
formula = y ~ M*(1 - exp(-(a*x))),
se = FALSE,
method.args = list(start=c(M=0, a=10)))+
facet_grid(rows = vars(month), cols = vars(site))
plot
field_rlc_plot
Any other advice are also welcome. I am educated as programmer so my codes are probably a bit messy. Thank you for helping.
Try this:
First, fit the data and extract the maximum of the fit.
my.fit <- function(month, site, data) {
fit <- glm(formula = etr ~ par + log(par),
data = data,
family=gaussian(link = "log")
)
#arrange the dersired output in a tibble
tibble(max = max(fit$fitted.values),
site = site,
month = month)
}
#Apply a custom function `my.fit` on each subset of data
#according to month and site using the group_by/nest/map method
# the results are rowbinded and returned in a data.frame
my.max<-
rlc2 %>%
dplyr::group_by(month, site) %>%
tidyr::nest() %>%
purrr::pmap_dfr(my.fit)
Next, join the results back on your data and plot a geom_line
rlc2 %>%
dplyr::left_join(my.max) %>%
ggplot(aes(x = par, y = etr))+
geom_point()+
stat_smooth(data = subset(rlc2, rlc2$month!="May" & rlc2$month!="Jun"),
method = "glm",
formula = y ~ x + log(x),
se = FALSE,
method.args = list(family = gaussian(link = "log"), start=c(a=0, b=0, c=0)))+
stat_smooth(data = subset(rlc2, rlc2$month=="May" | rlc2$month=="Jun"),
method = "nlsLM",
formula = y ~ M*(1 - exp(-(a*x))),
se = FALSE,
method.args = list(start=c(M=0, a=10)))+
geom_line(aes(y=max), col="red")+
facet_grid(rows = vars(month), cols = vars(site))

QUANTSTRAT - apply.paramset issue

I am trying to optimize MACD parameters for a trading strategy but unfortunately I am stuck with paramset.label value. This is the code:
################################# MACD PARAMETERS OPTIMIZATION
.fastMA <- (20:40)
.slowMA <- (30:70)
.nsamples = 10
strat.st <- 'volStrat'
# Paramset
add.distribution(strat.st,
paramset.label = 'EMA',
component.type = 'indicator',
component.label = 'macd.out',
variable = list(n = .fastMA),
label = 'nFast'
)
add.distribution(strat.st,
paramset.label = 'EMA',
component.type = 'indicator',
component.label = 'macd.out',
variable = list(n = .slowMA),
label = 'nSlow'
)
add.distribution.constraint(strat.st,
paramset.label = 'EMA',
distribution.label.1 = 'nFast',
distribution.label.2 = 'nSlow',
operator = '<',
label = 'nFast<nSlow'
)
results <- apply.paramset(strat.st,
paramset.label = 'EMA',
portfolio = portfolio2.st,
account = account.st,
nsamples = .nsamples,
verbose = TRUE)
stats <- results$tradeStats
print(stats)
When I run it, this error occurs for every sample:
evaluation # 1:
$param.combo
nFast nSlow
379 23 51
[1] "Processing param.combo 379"
nFast nSlow
379 23 51
result of evaluating expression:
<simpleError in strategy[[components.type]][[index]]: subscript out of bounds>
got results for task 1
numValues: 1, numResults: 1, stopped: FALSE
returning status FALSE
And then, for the last one, this is the error:
evaluation # 10:
$param.combo
nFast nSlow
585 40 60
[1] "Processing param.combo 585"
nFast nSlow
585 40 60
result of evaluating expression:
<simpleError in strategy[[components.type]][[index]]: subscript out of bounds>
got results for task 10
numValues: 10, numResults: 10, stopped: FALSE
first call to combine function
evaluating call object to combine results:
fun(result.1, result.2, result.3, result.4, result.5, result.6,
result.7, result.8, result.9, result.10)
error calling combine function:
<simpleError in fun(result.1, result.2, result.3, result.4, result.5, result.6, result.7, result.8, result.9, result.10): attempt to select less than one element>
numValues: 10, numResults: 10, stopped: TRUE
I really don't understand how can I fix it.
Can anyone how can I solve this?
Thank you so much
You didn't give the code before OPTIMIZATION part, so here is only my guess direction.
I understand you want to test 20:40 and 30:70, but in your OPTIMIZATION code, you add 2 distribution both pointing to " component.label = 'macd.out' ".
I did similar test, although they both use MA type indicators, they generally should not point to the same MA data(" component.label = 'macd.out' "), these code worked one distribution points to "component.label = 'fast'" and another points to "component.label = 'slow'" as they are pointing different datas so that they can be compared.
You can try to debug in this direction.