I am working with R where I have a variable '2 month 3 day 6 hour 70 minute' as string. The variable changes over time and therefore does not have the same length/structure. I need this variable to do a query on a PostgreSQL database by casting it to an interval. This works just fine.
Now I need this interval/string-variable as integer in minutes to do some mathematical calculations.
I thought of using sqldf the following:
library(sqldf)
my_interval = '2 month 3 day 6 hour 70 minute'
interval_minutes <- sqldf(paste("SELECT EXTRACT(EPOCH FROM '",my_interval,"'::INTERVAL)/60"))
interval_minutes_novar <- sqldf("SELECT EXTRACT(EPOCH FROM '2 month 3 day 6 hour 70 minute'::INTERVAL)/60")
but am getting Error: near "FROM": syntax error. From my research I know that sqldf uses SQLite, which does not support EXTRACT().
How can I convert a SQL-Interval to minutes using R?
1) sqldf/gsubfn Using gsubfn replace each word in my_interval with *, the appropriate number of minutes and + . Remove any trailing + and spaces and then either parse and evaluate mins or substitute mins into the sql statement. There are 365.25 / 12 days in the average month over 4 calendar years, having one leap year, but if you want to get the same answer as PostgreSQL replace 365.25 / 12 with 30, as noted in the comments.
library(sqldf) # this also pulls in gsubfn
# input
my_interval = '2 month 3 day 6 hour 70 minute'
L <- list(minute = " +", hour = "*60 +", day = "*60*24 +",
month = "*365.25 * 60 * 24 /12 +")
mins <- my_interval |>
gsubfn(pattern = "\\w+", replacement = L) |>
trimws(whitespace = "[+ ]")
eval(parse(text = mins))
## [1] 92410
fn$sqldf("select $mins mins")
## mins
## 1 92410
2) Base R This is a base R solution. Extract the numbers and words into separate vectors, translate the words to the appropriate factors and take their inner product. The discussion about 30 days months in (1) applies here too.
v <- c(minute = 1, hour = 60, day = 60 * 24, month = 365.25 * 60 * 24 /12)
nums <- my_interval |>
gsub(pattern = "[a-z]", replacement = "") |>
textConnection() |>
scan(quiet = TRUE)
words <- my_interval |>
gsub(pattern = "\\d", replacement = "") |>
textConnection() |>
scan(what = "", quiet = TRUE)
sum(v[words] * nums)
## [1] 92410
3) lubridate lubridate duration objects can be used.
library(lubridate)
as.numeric(duration(my_interval), "minute")
## [1] 92410
Although lubridate does not handle 30 day months (and Hadley says it is
not planned) we can preprocess my_interval to get the effect.
library(gsubfn)
library(lubridate)
my_interval |>
gsubfn(pattern = "(\\d+) +month", replacement = ~paste(30*as.numeric(x),"day")) |>
duration() |>
as.numeric("minute")
## [1] 91150
Adapting my answer here to this, I'll restate a rather gaping problem with this conversion: convert "month" into "seconds" is not constant, as months vary between 28-31 days. If we assume 30, though, for the sake of arguments, then:
func <- function(x, ptn) {
out <- gsub(paste0(".*?\\b([0-9.]+)\\s*", ptn, ".*"), "\\1", x, ignore.case = TRUE)
ifelse(out == x, NA, out)
}
res1 <- lapply(c(mon = "month", day = "day", hr = "hour", min = "minute"),
function(ptn) as.numeric(func(my_interval, ptn)))
res2 <- lapply(res1, function(z) ifelse(is.na(z), 0, z))
res2
# $mon
# [1] 2
# $day
# [1] 3
# $hr
# [1] 6
# $min
# [1] 70
86400 * (res2$mon*31 + res2$day) + 3600*res2$mon + 60*res2$hr
# [1] 5623560
Because I'm using lapply and simple vectorizable operations here, this also works if my_interval is more than one string (of similar format). It is robust to missing variables (presumed 0), and can include "year" (albeit with leap-year inaccuracies) and/or "second" if desired.
intervals <- c("2 month 3 day 6 hour 70 minute", "1 year", "1 hour 1 second")
res1 <- lapply(c(yr = "year", mon = "month", day = "day", hr = "hour", min = "minute", sec = "second"),
function(ptn) as.numeric(func(intervals, ptn)))
res2 <- lapply(res1, function(z) ifelse(is.na(z), 0, z))
str(res2)
# List of 6
# $ yr : num [1:3] 0 1 0
# $ mon: num [1:3] 2 0 0
# $ day: num [1:3] 3 0 0
# $ hr : num [1:3] 6 0 1
# $ min: num [1:3] 70 0 0
# $ sec: num [1:3] 0 0 1
86400 * (res2$yr*365 + res2$mon*31 + res2$day) + 3600*res2$mon + 60*res2$hr + res2$sec
# [1] 5.6e+06 3.2e+07 6.1e+01
My workaround is to use my PostgreSQL connection to do it:
library(sf)
library(RPostgres)
my_postgresql_connection <- dbConnect(Postgres(), dbname = "my_db", host = "my_host", port = 1234, user = "my_user", password = "my_password")
my_interval = '2 month 3 day 6 hour 70 minute'
my_dataframe <- st_read(my_postgresql_connection, query = paste("SELECT EXTRACT(EPOCH FROM '",my_interval,"'::INTERVAL)/60 as minutes"))
my_interval_in_minutes <- as.double(my_dataframe$minutes[1])
Related
This is my code:
p14 <- ggplot(plot14, aes(x = Harvest, y = Percentage, fill = factor(Plant, level = orderplants)))+
geom_col(show.legend = FALSE)+
geom_vline(xintercept=3.5)+
labs(y = "Bedekking %",
x = NULL,
fill = "Plantensoort")+
theme_classic()
plot 14
The code is about plant coverage of a plot (I have 70 plots in total). So bedekking is the Dutch word for coverage. The problem: the numbers represent the time periods of the measurements:
July 2020
August 2020
October 2020
May 2021
June 2021
July 2021
August 2021
October 2021
I would like the bars of each month to line up, so there would be two rows (2020 and 2021) where the bars of the same months are above/below each other (see ugly sketch below). Is this possible to code, or do I need to change my entire dataset?
very quick example of goal
It would be better if you could include some sample raw data as part of a reproducible example. I've created a little made up data to illustrate.
Ideally, you'd want the raw dates behind the numbered time measuements so that you can split the dates into months and years as separate variables. Assuming you only have the numbers, you could create some logic like this to make the months and years.
And you could use facet_wrap to show one year above another for the respective months.
library(tidyverse)
library(scales)
tibble(
harvest = seq(1, 8, 1),
percentage = rep(1, 8),
plant = rep(c("this", "that"), 4)
) |>
mutate(
month = case_when(
harvest %in% c(1, 6) ~ 7,
harvest %in% c(2, 7) ~ 8,
harvest %in% c(3, 8) ~ 10,
harvest %in% c(4) ~ 5,
harvest %in% c(5) ~ 6,
TRUE ~ NA_real_
),
year = case_when(
harvest <= 3 ~ 2020,
TRUE ~ 2021
)
) |>
ggplot(aes(month, percentage, fill = plant)) +
geom_col(show.legend = FALSE) +
labs(
y = "Bedekking %",
x = NULL,
fill = "Plantensoort"
) +
facet_wrap(~year, ncol = 1) +
scale_y_continuous(labels = label_percent()) +
theme_classic()
Created on 2022-05-13 by the reprex package (v2.0.1)
In Norway we have something called D- and S-numbers. These are National identification number where the day or month of birth are modified.
D-number
[d+4]dmmyy
S-number
dd[m+5]myy
I have a column with dates, some of them normal (ddmmyy) and some of them are formatted as D- or S-numbers. Leading zeroes are also missing.
df = pd.DataFrame({'dates': [241290, #24.12.90
710586, #31.05.86
105299, #10.02.99
56187] #05.11.87
})
dates
0 241290
1 710586
2 105299
3 56187
I've written this function to add leading zero and convert the dates, but this solution doesn't feel that great.
def func(s):
s = s.astype(str)
res = []
for index, value in s.items():
# Make sure all dates have 6 digits (add leading zero)
if len(value) == 5:
value = ('0' + value)
# Convert S- and D-dates to regular dates
if int(value[0]) > 3:
# substract 4 from the first digit
res.append(str(int(value[0]) - 4) + value[1:])
elif int(value[2]) > 1:
# subtract 5 from the third digit
res.append(value[:2] + str(int(value[2]) - 5) + value[3:])
else:
res.append(value)
return pd.Series(res)
Is there a smoother and faster way of accomplishing the same result?
Normalize dates by padding with 0 then explode into 3 columns of two digits (day, month, year). Apply your rules and combine columns to a DateTimeIndex:
# Suggested by #HenryEcker
# Changed: .pad(6, fillchar='0') to .zfill(6)
dates = df['dates'].astype(str).str.zfill(6).str.findall('(\d{2})') \
.apply(pd.Series).astype(int) \
.rename(columns={0: 'day', 1: 'month', 2: 'year'}) \
.agg({'day': lambda d: d if d <= 31 else d - 40,
'month': lambda m: m if m <= 12 else m - 50,
'year': lambda y: 1900 + y})
df['dates2'] = pd.to_datetime(dates)
Output:
>>> df
dates dates2
0 241290 1990-12-24
1 710586 1986-05-31
2 105299 1999-02-10
3 56187 1987-11-05
>>> dates
day month year
0 24 12 1990
1 31 5 1986
2 10 2 1999
3 5 11 1987
You can keep the Series as integers until the final step. The disadvantage of the method below is that the offsets do not match what the specifications say and may take more mental power to comprehend:
def func2(s):
# In mathematical operations, digits are counted from right
# so "first digit" becomes sixth and "third digit" becomes
# fourth in a 6-digit number
delta = np.select(
[s // 10**5 % 10 > 3, s // 10**3 % 10 > 1],
[4 * 10**5 , 5 * 10**3 ],
0
)
return (s - delta).astype('str').str.pad(6, fillchar='0')
Consider the following data.tables. The first defines a set of regions with start and end positions for each group 'x':
library(data.table)
d1 <- data.table(x = letters[1:5], start = c(1,5,19,30, 7), end = c(3,11,22,39,25))
setkey(d1, x, start)
# x start end
# 1: a 1 3
# 2: b 5 11
# 3: c 19 22
# 4: d 30 39
# 5: e 7 25
The second data set has the same grouping variable 'x', and positions 'pos' within each group:
d2 <- data.table(x = letters[c(1,1,2,2,3:5)], pos = c(2,3,3,12,20,52,10))
setkey(d2, x, pos)
# x pos
# 1: a 2
# 2: a 3
# 3: b 3
# 4: b 12
# 5: c 20
# 6: d 52
# 7: e 10
Ultimately I'd like to extract the rows in 'd2' where 'pos' falls within the range defined by 'start' and 'end', within each group x. The desired result is
# x pos start end
# 1: a 2 1 3
# 2: a 3 1 3
# 3: c 20 19 22
# 4: e 10 7 25
The start/end positions for any group x will never overlap but there may be gaps of values not in any region.
Now, I believe I should be using a rolling join. From what i can tell, I cannot use the "end" column in the join.
I've tried
d1[d2, roll = TRUE, nomatch = 0, mult = "all"][start <= end]
and got
# x start end
# 1: a 2 3
# 2: a 3 3
# 3: c 20 22
# 4: e 10 25
which is the right set of rows I want; However "pos" has become "start" and the original "start" has been lost. Is there a way to preserve all the columns with the roll join so i could report "start", "pos", "end" as desired?
Overlap joins was implemented with commit 1375 in data.table v1.9.3, and is available in the current stable release, v1.9.4. The function is called foverlaps. From NEWS:
29) Overlap joins #528 is now here, finally!! Except for type="equal" and maxgap and minoverlap arguments, everything else is implemented. Check out ?foverlaps and the examples there on its usage. This is a major feature addition to data.table.
Let's consider x, an interval defined as [a, b], where a <= b, and y, another interval defined as [c, d], where c <= d. The interval y is said to overlap x at all, iff d >= a and c <= b 1. And y is entirely contained within x, iff a <= c,d <= b 2. For the different types of overlaps implemented, please have a look at ?foverlaps.
Your question is a special case of an overlap join: in d1 you have true physical intervals with start and end positions. In d2 on the other hand, there are only positions (pos), not intervals. To be able to do an overlap join, we need to create intervals also in d2. This is achieved by creating an additional variable pos2, which is identical to pos (d2[, pos2 := pos]). Thus, we now have an interval in d2, albeit with identical start and end coordinates. This 'virtual, zero-width interval' in d2 can then be used in foverlap to do an overlap join with d1:
require(data.table) ## 1.9.3
setkey(d1)
d2[, pos2 := pos]
foverlaps(d2, d1, by.x = names(d2), type = "within", mult = "all", nomatch = 0L)
# x start end pos pos2
# 1: a 1 3 2 2
# 2: a 1 3 3 3
# 3: c 19 22 20 20
# 4: e 7 25 10 10
by.y by default is key(y), so we skipped it. by.x by default takes key(x) if it exists, and if not takes key(y). But a key doesn't exist for d2, and we can't set the columns from y, because they don't have the same names. So, we set by.x explicitly.
The type of overlap is within, and we'd like to have all matches, only if there is a match.
NB: foverlaps uses data.table's binary search feature (along with roll where necessary) under the hood, but some function arguments (types of overlaps, maxgap, minoverlap etc..) are inspired by the function findOverlaps() from the Bioconductor package IRanges, an excellent package (and so is GenomicRanges, which extends IRanges for Genomics).
So what's the advantage?
A benchmark on the code above on your data results in foverlaps() slower than Gabor's answer (Timings: Gabor's data.table solution = 0.004 vs foverlaps = 0.021 seconds). But does it really matter at this granularity?
What would be really interesting is to see how well it scales - in terms of both speed and memory. In Gabor's answer, we join based on the key column x. And then filter the results.
What if d1 has about 40K rows and d2 has a 100K rows (or more)? For each row in d2 that matches x in d1, all those rows will be matched and returned, only to be filtered later. Here's an example of your Q scaled only slightly:
Generate data:
require(data.table)
set.seed(1L)
n = 20e3L; k = 100e3L
idx1 = sample(100, n, TRUE)
idx2 = sample(100, n, TRUE)
d1 = data.table(x = sample(letters[1:5], n, TRUE),
start = pmin(idx1, idx2),
end = pmax(idx1, idx2))
d2 = data.table(x = sample(letters[1:15], k, TRUE),
pos1 = sample(60:150, k, TRUE))
foverlaps:
system.time({
setkey(d1)
d2[, pos2 := pos1]
ans1 = foverlaps(d2, d1, by.x=1:3, type="within", nomatch=0L)
})
# user system elapsed
# 3.028 0.635 3.745
This took ~ 1GB of memory in total, out of which ans1 is 420MB. Most of the time spent here is on subset really. You can check it by setting the argument verbose=TRUE.
Gabor's solutions:
## new session - data.table solution
system.time({
setkey(d1, x)
ans2 <- d1[d2, allow.cartesian=TRUE, nomatch=0L][between(pos1, start, end)]
})
# user system elapsed
# 15.714 4.424 20.324
And this took a total of ~3.5GB.
I just noted that Gabor already mentions the memory required for intermediate results. So, trying out sqldf:
# new session - sqldf solution
system.time(ans3 <- sqldf("select * from d1 join
d2 using (x) where pos1 between start and end"))
# user system elapsed
# 73.955 1.605 77.049
Took a total of ~1.4GB. So, it definitely uses less memory than the one shown above.
[The answers were verified to be identical after removing pos2 from ans1 and setting key on both answers.]
Note that this overlap join is designed with problems where d2 doesn't necessarily have identical start and end coordinates (ex: genomics, the field where I come from, where d2 is usually about 30-150 million or more rows).
foverlaps() is stable, but is still under development, meaning some arguments and names might get changed.
NB: Since I mentioned GenomicRanges above, it is also perfectly capable of solving this problem. It uses interval trees under the hood, and is quite memory efficient as well. In my benchmarks on genomics data, foverlaps() is faster. But that's for another (blog) post, some other time.
data.table v1.9.8+ has a new feature - non-equi joins. With that, this operation becomes even more straightforward:
require(data.table) #v1.9.8+
# no need to set keys on `d1` or `d2`
d2[d1, .(x, pos=x.pos, start, end), on=.(x, pos>=start, pos<=end), nomatch=0L]
# x pos start end
# 1: a 2 1 3
# 2: a 3 1 3
# 3: c 20 19 22
# 4: e 10 7 25
1) sqldf This is not data.table but complex join criteria are easy to specify in a straight forward manner in SQL:
library(sqldf)
sqldf("select * from d1 join d2 using (x) where pos between start and end")
giving:
x start end pos
1 a 1 3 2
2 a 1 3 3
3 c 19 22 20
4 e 7 25 10
2) data.table For a data.table answer try this:
library(data.table)
setkey(d1, x)
setkey(d2, x)
d1[d2][between(pos, start, end)]
giving:
x start end pos
1: a 1 3 2
2: a 1 3 3
3: c 19 22 20
4: e 7 25 10
Note that this does have the disadvantage of forming the possibly large intermeidate result d1[d2] which SQL may not do. The remaining solutions may have this problem too.
3) dplyr This suggests the corresponding dplyr solution. We also use between from data.table:
library(dplyr)
library(data.table) # between
d1 %>%
inner_join(d2) %>%
filter(between(pos, start, end))
giving:
Joining by: "x"
x start end pos
1 a 1 3 2
2 a 1 3 3
3 c 19 22 20
4 e 7 25 10
4) merge/subset Using only the base of R:
subset(merge(d1, d2), start <= pos & pos <= end)
giving:
x start end pos
1: a 1 3 2
2: a 1 3 3
3: c 19 22 20
4: e 7 25 10
Added Note that the data table solution here is much faster than the one in the other answer:
dt1 <- function() {
d1 <- data.table(x=letters[1:5], start=c(1,5,19,30, 7), end=c(3,11,22,39,25))
d2 <- data.table(x=letters[c(1,1,2,2,3:5)], pos=c(2,3,3,12,20,52,10))
setkey(d1, x, start)
idx1 = d1[d2, which=TRUE, roll=Inf] # last observation carried forwards
setkey(d1, x, end)
idx2 = d1[d2, which=TRUE, roll=-Inf] # next observation carried backwards
idx = which(!is.na(idx1) & !is.na(idx2))
ans1 <<- cbind(d1[idx1[idx]], d2[idx, list(pos)])
}
dt2 <- function() {
d1 <- data.table(x=letters[1:5], start=c(1,5,19,30, 7), end=c(3,11,22,39,25))
d2 <- data.table(x=letters[c(1,1,2,2,3:5)], pos=c(2,3,3,12,20,52,10))
setkey(d1, x)
ans2 <<- d1[d2][between(pos, start, end)]
}
all.equal(as.data.frame(ans1), as.data.frame(ans2))
## TRUE
benchmark(dt1(), dt2())[1:4]
## test replications elapsed relative
## 1 dt1() 100 1.45 1.667
## 2 dt2() 100 0.87 1.000 <-- from (2) above
Overlap joins are available in dplyr 1.1.0 via the function join_by.
With join_by, you can do overlap join with between, or manually with >= and <=:
library(dplyr)
inner_join(d2, d1, by = join_by(x, between(pos, start, end)))
# x pos start end
#1 a 2 1 3
#2 a 3 1 3
#3 c 20 19 22
#4 e 10 7 25
inner_join(d2, d1, by = join_by(x, pos >= start, pos <= end))
# x pos start end
#1 a 2 1 3
#2 a 3 1 3
#3 c 20 19 22
#4 e 10 7 25
Using fuzzyjoin :
result <- fuzzyjoin::fuzzy_inner_join(d1, d2,
by = c('x', 'pos' = 'start', 'pos' = 'end'),
match_fun = list(`==`, `>=`, `<=`))
result
# x.x pos x.y start end
# <chr> <dbl> <chr> <dbl> <dbl>
#1 a 2 a 1 3
#2 a 3 a 1 3
#3 c 20 c 19 22
#4 e 10 e 7 25
Since fuzzyjoin returns all the columns we might need to do some cleaning to keep the columns that we want.
library(dplyr)
result %>% select(x = x.x, pos, start, end)
# A tibble: 4 x 4
# x pos start end
# <chr> <dbl> <dbl> <dbl>
#1 a 2 1 3
#2 a 3 1 3
#3 c 20 19 22
#4 e 10 7 25
I want to be able to visualise and analyse data for a specific period of time. The data is in different files so i read each file and appended them into an array. The data I have is in Julien date format so i used a function to convert it to datetime so that i am able to plot the data. However the function is now giving an error:
File "D:\DATA\TEC DATA\2018\amco\RES\CMN_append_df_stack_merge.py", line 162, in jd_to_date
jd = jd + 0.5
TypeError: can only concatenate str (not "float") to str
It was not giving this error before however the plot had lines crisscrossing everywhere which i suspect is because of the datetime not being formatted correctly.
Here is the code i used
*- coding: utf-8 -*-
"""
Created on Mon Mar 9 17:51:41 2020
#author: user
"""
import glob as glob
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from datetime import datetime
"""
Functions for converting dates to/from JD and MJD. Assumes dates are historical
dates, including the transition from the Julian calendar to the Gregorian
calendar in 1582. No support for proleptic Gregorian/Julian calendars.
:Author: Matt Davis
:Website: http://github.com/jiffyclub
"""
import math
import datetime as dt
# Note: The Python datetime module assumes an infinitely valid Gregorian calendar.
# The Gregorian calendar took effect after 10-15-1582 and the dates 10-05 through
# 10-14-1582 never occurred. Python datetime objects will produce incorrect
# time deltas if one date is from before 10-15-1582.
def mjd_to_jd(mjd):
"""
Convert Modified Julian Day to Julian Day.
Parameters
----------
mjd : float
Modified Julian Day
Returns
-------
jd : float
Julian Day
"""
return mjd + 2400000.5
def jd_to_mjd(jd):
"""
Convert Julian Day to Modified Julian Day
Parameters
----------
jd : float
Julian Day
Returns
-------
mjd : float
Modified Julian Day
"""
return jd - 2400000.5
def date_to_jd(year,month,day):
"""
Convert a date to Julian Day.
Algorithm from 'Practical Astronomy with your Calculator or Spreadsheet',
4th ed., Duffet-Smith and Zwart, 2011.
Parameters
----------
year : int
Year as integer. Years preceding 1 A.D. should be 0 or negative.
The year before 1 A.D. is 0, 10 B.C. is year -9.
month : int
Month as integer, Jan = 1, Feb. = 2, etc.
day : float
Day, may contain fractional part.
Returns
-------
jd : float
Julian Day
Examples
--------
Convert 6 a.m., February 17, 1985 to Julian Day
>>> date_to_jd(1985,2,17.25)
2446113.75
"""
if month == 1 or month == 2:
yearp = year - 1
monthp = month + 12
else:
yearp = year
monthp = month
# this checks where we are in relation to October 15, 1582, the beginning
# of the Gregorian calendar.
if ((year < 1582) or
(year == 1582 and month < 10) or
(year == 1582 and month == 10 and day < 15)):
# before start of Gregorian calendar
B = 0
else:
# after start of Gregorian calendar
A = math.trunc(yearp / 100.)
B = 2 - A + math.trunc(A / 4.)
if yearp < 0:
C = math.trunc((365.25 * yearp) - 0.75)
else:
C = math.trunc(365.25 * yearp)
D = math.trunc(30.6001 * (monthp + 1))
jd = B + C + D + day + 1720994.5
return jd
def jd_to_date(jd):
"""
Convert Julian Day to date.
Algorithm from 'Practical Astronomy with your Calculator or Spreadsheet',
4th ed., Duffet-Smith and Zwart, 2011.
Parameters
----------
jd : float
Julian Day
Returns
-------
year : int
Year as integer. Years preceding 1 A.D. should be 0 or negative.
The year before 1 A.D. is 0, 10 B.C. is year -9.
month : int
Month as integer, Jan = 1, Feb. = 2, etc.
day : float
Day, may contain fractional part.
Examples
--------
Convert Julian Day 2446113.75 to year, month, and day.
>>> jd_to_date(2446113.75)
(1985, 2, 17.25)
"""
jd = jd + 0.5
F, I = math.modf(jd)
I = int(I)
A = math.trunc((I - 1867216.25)/36524.25)
if I > 2299160:
B = I + 1 + A - math.trunc(A / 4.)
else:
B = I
C = B + 1524
D = math.trunc((C - 122.1) / 365.25)
E = math.trunc(365.25 * D)
G = math.trunc((C - E) / 30.6001)
day = C - E + F - math.trunc(30.6001 * G)
if G < 13.5:
month = G - 1
else:
month = G - 13
if month > 2.5:
year = D - 4716
else:
year = D - 4715
return year, month, day
def hmsm_to_days(hour=0,min=0,sec=0,micro=0):
"""
Convert hours, minutes, seconds, and microseconds to fractional days.
Parameters
----------
hour : int, optional
Hour number. Defaults to 0.
min : int, optional
Minute number. Defaults to 0.
sec : int, optional
Second number. Defaults to 0.
micro : int, optional
Microsecond number. Defaults to 0.
Returns
-------
days : float
Fractional days.
Examples
--------
>>> hmsm_to_days(hour=6)
0.25
"""
days = sec + (micro / 1.e6)
days = min + (days / 60.)
days = hour + (days / 60.)
return days / 24.
def days_to_hmsm(days):
"""
Convert fractional days to hours, minutes, seconds, and microseconds.
Precision beyond microseconds is rounded to the nearest microsecond.
Parameters
----------
days : float
A fractional number of days. Must be less than 1.
Returns
-------
hour : int
Hour number.
min : int
Minute number.
sec : int
Second number.
micro : int
Microsecond number.
Raises
------
ValueError
If `days` is >= 1.
Examples
--------
>>> days_to_hmsm(0.1)
(2, 24, 0, 0)
"""
hours = days * 24.
hours, hour = math.modf(hours)
mins = hours * 60.
mins, min = math.modf(mins)
secs = mins * 60.
secs, sec = math.modf(secs)
micro = round(secs * 1.e6)
return int(hour), int(min), int(sec), int(micro)
def datetime_to_jd(date):
"""
Convert a `datetime.datetime` object to Julian Day.
Parameters
----------
date : `datetime.datetime` instance
Returns
-------
jd : float
Julian day.
Examples
--------
>>> d = datetime.datetime(1985,2,17,6)
>>> d
datetime.datetime(1985, 2, 17, 6, 0)
>>> jdutil.datetime_to_jd(d)
2446113.75
"""
days = date.day + hmsm_to_days(date.hour,date.minute,date.second,date.microsecond)
return date_to_jd(date.year,date.month,days)
def jd_to_datetime(jd):
"""
Convert a Julian Day to an `jdutil.datetime` object.
Parameters
----------
jd : float
Julian day.
Returns
-------
dt : `jdutil.datetime` object
`jdutil.datetime` equivalent of Julian day.
--------------------------------------------------------------------------------------------------------------------------------------------------------
--------
>>> jd_to_datetime(2446113.75)
datetime(1985, 2, 17, 6, 0)
"""
year, month, day = jd_to_date(jd)
frac_days,day = math.modf(day)
day = int(day)
hour,min,sec,micro = days_to_hmsm(frac_days)
return datetime(year,month,day,hour,min,sec,micro)
'---------------------------------------------------------------------------------------------------------------------------------------------------------'
#"""
#JULIEN DAY CONVERTOR
def timedelta_to_days(td):
"""
Convert a `datetime.timedelta` object to a total number of days.
Parameters
----------
td : `datetime.timedelta` instance
Returns
-------
days : float
Total number of days in the `datetime.timedelta` object.
Examples
--------
>>> td = datetime.timedelta(4.5)
>>> td
datetime.timedelta(4, 43200)
>>> timedelta_to_days(td)
4.5
"""
seconds_in_day = 24. * 3600.
days = td.days + (td.seconds + (td.microseconds * 10.e6)) / seconds_in_day
return days
"""
'Start of main Code---------------------------------------------------------------------------------------------------'
"""
#########
files = glob.glob('*.Cmn')
files = files[:3]
np_array_values = []
for file in files:
df = pd.read_csv(file, delimiter="\t", sep ='\t', skiprows = 5, names = ["Jdate" ,'Time' ,'PRN' ,'Az','Ele','Lat', 'Lon' ,'Stec', 'Vtec', 'S4'])
#df.set_index('Jdatet')
#Appending Each data frame to the Giant Array np-array_values then Stacking them
np_array_values.append (df)
merge_values = np.vstack(np_array_values)
#CONVERTING ARRAY BACK TO DATA FRAME
TEC_data =pd.DataFrame(merge_values)
Vtec = TEC_data.loc[:,7]
jdate = TEC_data.loc[:,0]
Time = TEC_data.loc[:,1]
month_name = file[:-7]
STATION_NAME = file[:4]
df.replace('-99.000', np.nan)
pos= np.where(jdate)[0]
pos1=np.where(Vtec[pos]<-20.0)[0]
Vtec[pos1]='nan'
fulldate = []
for i in jdate:
a = jd_to_datetime(i)
fulldate.append(a)
#plt.show()
plt.plot(fulldate, Vtec)
#plt.xlim(0, 24)
#plt.xticks(np.arange(0, 26, 2))
plt.ylabel('TECU')
plt.xlabel('Date')
#plt.grid(axis='both')
plt.title("Station : " + STATION_NAME.upper())t**
Here is a sample of the data. The columns are:
nknown_station, "E:\GPS DATA\rbmc2\2018\amco\amco3630.18o"
-4.87199 294.66602 75.87480
Jdatet Time PRN Az Ele Lat Lon Stec Vtec S4
2458481.500000 -24.000000 1 198.34 23.37 -10.70 292.70 18.62 13.70 -99.000
2458481.500347 0.008333 1 198.21 23.53 -10.67 292.73 18.65 13.74 -99.000
2458481.500694 0.016667 1 198.07 23.69 -10.64 292.75 18.76 13.84 -99.000
2458481.501042 0.025000 1 197.94 23.85 -10.61 292.78 18.68 13.83 -99.000
2458481.501389 0.033333 1 197.81 24.01 -10.58 292.80 18.60 13.83 -99.000
2458481.501736 0.041667 1 197.68 24.17 -10.55 292.83 18.53 13.83 -99.000
2458481.502083 0.050000 1 197.54 24.33 -10.52 292.85 18.53 13.86 -99.000
2458481.502431 0.058333 1 197.41 24.49 -10.49 292.88 18.51 13.88 -99.000
2458481.502778 0.066667 1 197.28 24.65 -10.46 292.90 18.66 14.00 -99.000
2458481.503125 0.075000 1 197.15 24.81 -10.43 292.92 18.78 14.09 0.238
2458481.503472 0.083333 1 197.01 24.98 -10.40 292.95 18.55 14.01 -99.000
2458481.503819 0.091667 1 196.88 25.14 -10.37 292.97 18.39 13.96 -99.000
2458481.504167 0.100000 1 196.75 25.30 -10.34 292.99 18.33 13.97 -99.000
2458481.504514 0.108333 1 196.62 25.47 -10.31 293.02 18.20 13.94 -99.000
2458481.504861 0.116667 1 196.49 25.63 -10.28 293.04 17.61 13.67 -99.000
2458481.505208 0.125000 1 196.36 25.80 -10.25 293.06 16.74 13.25 -99.000
2458481.505556 0.133333 1 196.23 25.96 -10.22 293.09 16.06 12.92 -99.000
2458481.505903 0.141667 1 196.10 26.13 -10.19 293.11 15.46 12.64 -99.000
2458481.506250 0.150000 1 195.97 26.30 -10.16 293.13 14.77 12.31 -99.000
2458481.506597 0.158333 1 195.84 26.46 -10.13 293.15 14.42 12.15 0.127
It seems like your variable jd is a string instead of a float. Can you show some sample data? This would probably explain why this is happening.
Some additional points about your code:
While looping over all files, you want to create one dataframe. This can be done easier:
dfMerged = pd.DataFrame()
for file in files:
dfTemp = pd.read_csv(...)
dfMerged = dfMerged.append(dfTemp, ignore_index=True)
This would lead to the additional benefit that you can use your column headings like they are supposed to be used, e.g., Vtec = dfMerged['Vtec'].
Your month_name and STATION_NAME are overwritten in each iteration of the loop. If they are different for each file, you want to save them as a list or as a separate column in the dataframe, e.g., dfTemp['Station'] = STATION_NAME. Keep in mind to do this before you merge the dataframes.
Also, the replace can be substituted by adding '-99.000' to the na_values argument in read_csv.
Selecting your data can be done with pandas. dfMerged.loc[(dfMerged['Jdate'].notna()) & (dfMerged['Vtec'] < -20), 'Vtec'] = pd.NA.
I added the list containing the datetime object as an index to the Dataframe.
Apparently to recognise the dates you need to use plt.plot_date instead of using the usual plt.plot. Datetime index will take care of the rest.
I have a DataFrame that consists of many stacked time series. The index is (poolId, month) where both are integers, the "month" being the number of months since 2000. What's the best way to calculate one-month lagged versions of multiple variables?
Right now, I do something like:
cols_to_shift = ["bal", ...5 more columns...]
df_shift = df[cols_to_shift].groupby(level=0).transform(lambda x: x.shift(-1))
For my data, this took me a full 60 s to run. (I have 48k different pools and a total of 718k rows.)
I'm converting this from R code and the equivalent data.table call:
dt.shift <- dt[, list(bal=myshift(bal), ...), by=list(poolId)]
only takes 9 s to run. (Here "myshift" is something like "function(x) c(x[-1], NA)".)
Is there a way I can get the pandas verison to be back in line speed-wise? I tested this on 0.8.1.
Edit: Here's an example of generating a close-enough data set, so you can get some idea of what I mean:
ids = np.arange(48000)
lens = np.maximum(np.round(15+9.5*np.random.randn(48000)), 1.0).astype(int)
id_vec = np.repeat(ids, lens)
lens_shift = np.concatenate(([0], lens[:-1]))
mon_vec = np.arange(lens.sum()) - np.repeat(np.cumsum(lens_shift), lens)
n = len(mon_vec)
df = pd.DataFrame.from_items([('pool', id_vec), ('month', mon_vec)] + [(c, np.random.rand(n)) for c in 'abcde'])
df = df.set_index(['pool', 'month'])
%time df_shift = df.groupby(level=0).transform(lambda x: x.shift(-1))
That took 64 s when I tried it. This data has every series starting at month 0; really, they should all end at month np.max(lens), with ragged start dates, but good enough.
Edit 2: Here's some comparison R code. This takes 0.8 s. Factor of 80, not good.
library(data.table)
ids <- 1:48000
lens <- as.integer(pmax(1, round(rnorm(ids, mean=15, sd=9.5))))
id.vec <- rep(ids, times=lens)
lens.shift <- c(0, lens[-length(lens)])
mon.vec <- (1:sum(lens)) - rep(cumsum(lens.shift), times=lens)
n <- length(id.vec)
dt <- data.table(pool=id.vec, month=mon.vec, a=rnorm(n), b=rnorm(n), c=rnorm(n), d=rnorm(n), e=rnorm(n))
setkey(dt, pool, month)
myshift <- function(x) c(x[-1], NA)
system.time(dt.shift <- dt[, list(month=month, a=myshift(a), b=myshift(b), c=myshift(c), d=myshift(d), e=myshift(e)), by=pool])
I would suggest you reshape the data and do a single shift versus the groupby approach:
result = df.unstack(0).shift(1).stack()
This switches the order of the levels so you'd want to swap and reorder:
result = result.swaplevel(0, 1).sortlevel(0)
You can verify it's been lagged by one period (you want shift(1) instead of shift(-1)):
In [17]: result.ix[1]
Out[17]:
a b c d e
month
1 0.752511 0.600825 0.328796 0.852869 0.306379
2 0.251120 0.871167 0.977606 0.509303 0.809407
3 0.198327 0.587066 0.778885 0.565666 0.172045
4 0.298184 0.853896 0.164485 0.169562 0.923817
5 0.703668 0.852304 0.030534 0.415467 0.663602
6 0.851866 0.629567 0.918303 0.205008 0.970033
7 0.758121 0.066677 0.433014 0.005454 0.338596
8 0.561382 0.968078 0.586736 0.817569 0.842106
9 0.246986 0.829720 0.522371 0.854840 0.887886
10 0.709550 0.591733 0.919168 0.568988 0.849380
11 0.997787 0.084709 0.664845 0.808106 0.872628
12 0.008661 0.449826 0.841896 0.307360 0.092581
13 0.727409 0.791167 0.518371 0.691875 0.095718
14 0.928342 0.247725 0.754204 0.468484 0.663773
15 0.934902 0.692837 0.367644 0.061359 0.381885
16 0.828492 0.026166 0.050765 0.524551 0.296122
17 0.589907 0.775721 0.061765 0.033213 0.793401
18 0.532189 0.678184 0.747391 0.199283 0.349949
In [18]: df.ix[1]
Out[18]:
a b c d e
month
0 0.752511 0.600825 0.328796 0.852869 0.306379
1 0.251120 0.871167 0.977606 0.509303 0.809407
2 0.198327 0.587066 0.778885 0.565666 0.172045
3 0.298184 0.853896 0.164485 0.169562 0.923817
4 0.703668 0.852304 0.030534 0.415467 0.663602
5 0.851866 0.629567 0.918303 0.205008 0.970033
6 0.758121 0.066677 0.433014 0.005454 0.338596
7 0.561382 0.968078 0.586736 0.817569 0.842106
8 0.246986 0.829720 0.522371 0.854840 0.887886
9 0.709550 0.591733 0.919168 0.568988 0.849380
10 0.997787 0.084709 0.664845 0.808106 0.872628
11 0.008661 0.449826 0.841896 0.307360 0.092581
12 0.727409 0.791167 0.518371 0.691875 0.095718
13 0.928342 0.247725 0.754204 0.468484 0.663773
14 0.934902 0.692837 0.367644 0.061359 0.381885
15 0.828492 0.026166 0.050765 0.524551 0.296122
16 0.589907 0.775721 0.061765 0.033213 0.793401
17 0.532189 0.678184 0.747391 0.199283 0.349949
Perf isn't too bad with this method (it might be a touch slower in 0.9.0):
In [19]: %time result = df.unstack(0).shift(1).stack()
CPU times: user 1.46 s, sys: 0.24 s, total: 1.70 s
Wall time: 1.71 s