Changing digits in numbers based on a conditions - pandas

In Norway we have something called D- and S-numbers. These are National identification number where the day or month of birth are modified.
D-number
[d+4]dmmyy
S-number
dd[m+5]myy
I have a column with dates, some of them normal (ddmmyy) and some of them are formatted as D- or S-numbers. Leading zeroes are also missing.
df = pd.DataFrame({'dates': [241290, #24.12.90
710586, #31.05.86
105299, #10.02.99
56187] #05.11.87
})
dates
0 241290
1 710586
2 105299
3 56187
I've written this function to add leading zero and convert the dates, but this solution doesn't feel that great.
def func(s):
s = s.astype(str)
res = []
for index, value in s.items():
# Make sure all dates have 6 digits (add leading zero)
if len(value) == 5:
value = ('0' + value)
# Convert S- and D-dates to regular dates
if int(value[0]) > 3:
# substract 4 from the first digit
res.append(str(int(value[0]) - 4) + value[1:])
elif int(value[2]) > 1:
# subtract 5 from the third digit
res.append(value[:2] + str(int(value[2]) - 5) + value[3:])
else:
res.append(value)
return pd.Series(res)
Is there a smoother and faster way of accomplishing the same result?

Normalize dates by padding with 0 then explode into 3 columns of two digits (day, month, year). Apply your rules and combine columns to a DateTimeIndex:
# Suggested by #HenryEcker
# Changed: .pad(6, fillchar='0') to .zfill(6)
dates = df['dates'].astype(str).str.zfill(6).str.findall('(\d{2})') \
.apply(pd.Series).astype(int) \
.rename(columns={0: 'day', 1: 'month', 2: 'year'}) \
.agg({'day': lambda d: d if d <= 31 else d - 40,
'month': lambda m: m if m <= 12 else m - 50,
'year': lambda y: 1900 + y})
df['dates2'] = pd.to_datetime(dates)
Output:
>>> df
dates dates2
0 241290 1990-12-24
1 710586 1986-05-31
2 105299 1999-02-10
3 56187 1987-11-05
>>> dates
day month year
0 24 12 1990
1 31 5 1986
2 10 2 1999
3 5 11 1987

You can keep the Series as integers until the final step. The disadvantage of the method below is that the offsets do not match what the specifications say and may take more mental power to comprehend:
def func2(s):
# In mathematical operations, digits are counted from right
# so "first digit" becomes sixth and "third digit" becomes
# fourth in a 6-digit number
delta = np.select(
[s // 10**5 % 10 > 3, s // 10**3 % 10 > 1],
[4 * 10**5 , 5 * 10**3 ],
0
)
return (s - delta).astype('str').str.pad(6, fillchar='0')

Related

Converting SQL-Interval-String to minutes using R

I am working with R where I have a variable '2 month 3 day 6 hour 70 minute' as string. The variable changes over time and therefore does not have the same length/structure. I need this variable to do a query on a PostgreSQL database by casting it to an interval. This works just fine.
Now I need this interval/string-variable as integer in minutes to do some mathematical calculations.
I thought of using sqldf the following:
library(sqldf)
my_interval = '2 month 3 day 6 hour 70 minute'
interval_minutes <- sqldf(paste("SELECT EXTRACT(EPOCH FROM '",my_interval,"'::INTERVAL)/60"))
interval_minutes_novar <- sqldf("SELECT EXTRACT(EPOCH FROM '2 month 3 day 6 hour 70 minute'::INTERVAL)/60")
but am getting Error: near "FROM": syntax error. From my research I know that sqldf uses SQLite, which does not support EXTRACT().
How can I convert a SQL-Interval to minutes using R?
1) sqldf/gsubfn Using gsubfn replace each word in my_interval with *, the appropriate number of minutes and + . Remove any trailing + and spaces and then either parse and evaluate mins or substitute mins into the sql statement. There are 365.25 / 12 days in the average month over 4 calendar years, having one leap year, but if you want to get the same answer as PostgreSQL replace 365.25 / 12 with 30, as noted in the comments.
library(sqldf) # this also pulls in gsubfn
# input
my_interval = '2 month 3 day 6 hour 70 minute'
L <- list(minute = " +", hour = "*60 +", day = "*60*24 +",
month = "*365.25 * 60 * 24 /12 +")
mins <- my_interval |>
gsubfn(pattern = "\\w+", replacement = L) |>
trimws(whitespace = "[+ ]")
eval(parse(text = mins))
## [1] 92410
fn$sqldf("select $mins mins")
## mins
## 1 92410
2) Base R This is a base R solution. Extract the numbers and words into separate vectors, translate the words to the appropriate factors and take their inner product. The discussion about 30 days months in (1) applies here too.
v <- c(minute = 1, hour = 60, day = 60 * 24, month = 365.25 * 60 * 24 /12)
nums <- my_interval |>
gsub(pattern = "[a-z]", replacement = "") |>
textConnection() |>
scan(quiet = TRUE)
words <- my_interval |>
gsub(pattern = "\\d", replacement = "") |>
textConnection() |>
scan(what = "", quiet = TRUE)
sum(v[words] * nums)
## [1] 92410
3) lubridate lubridate duration objects can be used.
library(lubridate)
as.numeric(duration(my_interval), "minute")
## [1] 92410
Although lubridate does not handle 30 day months (and Hadley says it is
not planned) we can preprocess my_interval to get the effect.
library(gsubfn)
library(lubridate)
my_interval |>
gsubfn(pattern = "(\\d+) +month", replacement = ~paste(30*as.numeric(x),"day")) |>
duration() |>
as.numeric("minute")
## [1] 91150
Adapting my answer here to this, I'll restate a rather gaping problem with this conversion: convert "month" into "seconds" is not constant, as months vary between 28-31 days. If we assume 30, though, for the sake of arguments, then:
func <- function(x, ptn) {
out <- gsub(paste0(".*?\\b([0-9.]+)\\s*", ptn, ".*"), "\\1", x, ignore.case = TRUE)
ifelse(out == x, NA, out)
}
res1 <- lapply(c(mon = "month", day = "day", hr = "hour", min = "minute"),
function(ptn) as.numeric(func(my_interval, ptn)))
res2 <- lapply(res1, function(z) ifelse(is.na(z), 0, z))
res2
# $mon
# [1] 2
# $day
# [1] 3
# $hr
# [1] 6
# $min
# [1] 70
86400 * (res2$mon*31 + res2$day) + 3600*res2$mon + 60*res2$hr
# [1] 5623560
Because I'm using lapply and simple vectorizable operations here, this also works if my_interval is more than one string (of similar format). It is robust to missing variables (presumed 0), and can include "year" (albeit with leap-year inaccuracies) and/or "second" if desired.
intervals <- c("2 month 3 day 6 hour 70 minute", "1 year", "1 hour 1 second")
res1 <- lapply(c(yr = "year", mon = "month", day = "day", hr = "hour", min = "minute", sec = "second"),
function(ptn) as.numeric(func(intervals, ptn)))
res2 <- lapply(res1, function(z) ifelse(is.na(z), 0, z))
str(res2)
# List of 6
# $ yr : num [1:3] 0 1 0
# $ mon: num [1:3] 2 0 0
# $ day: num [1:3] 3 0 0
# $ hr : num [1:3] 6 0 1
# $ min: num [1:3] 70 0 0
# $ sec: num [1:3] 0 0 1
86400 * (res2$yr*365 + res2$mon*31 + res2$day) + 3600*res2$mon + 60*res2$hr + res2$sec
# [1] 5.6e+06 3.2e+07 6.1e+01
My workaround is to use my PostgreSQL connection to do it:
library(sf)
library(RPostgres)
my_postgresql_connection <- dbConnect(Postgres(), dbname = "my_db", host = "my_host", port = 1234, user = "my_user", password = "my_password")
my_interval = '2 month 3 day 6 hour 70 minute'
my_dataframe <- st_read(my_postgresql_connection, query = paste("SELECT EXTRACT(EPOCH FROM '",my_interval,"'::INTERVAL)/60 as minutes"))
my_interval_in_minutes <- as.double(my_dataframe$minutes[1])

How can I merge two data frames on a range of dates? [duplicate]

Consider the following data.tables. The first defines a set of regions with start and end positions for each group 'x':
library(data.table)
d1 <- data.table(x = letters[1:5], start = c(1,5,19,30, 7), end = c(3,11,22,39,25))
setkey(d1, x, start)
# x start end
# 1: a 1 3
# 2: b 5 11
# 3: c 19 22
# 4: d 30 39
# 5: e 7 25
The second data set has the same grouping variable 'x', and positions 'pos' within each group:
d2 <- data.table(x = letters[c(1,1,2,2,3:5)], pos = c(2,3,3,12,20,52,10))
setkey(d2, x, pos)
# x pos
# 1: a 2
# 2: a 3
# 3: b 3
# 4: b 12
# 5: c 20
# 6: d 52
# 7: e 10
Ultimately I'd like to extract the rows in 'd2' where 'pos' falls within the range defined by 'start' and 'end', within each group x. The desired result is
# x pos start end
# 1: a 2 1 3
# 2: a 3 1 3
# 3: c 20 19 22
# 4: e 10 7 25
The start/end positions for any group x will never overlap but there may be gaps of values not in any region.
Now, I believe I should be using a rolling join. From what i can tell, I cannot use the "end" column in the join.
I've tried
d1[d2, roll = TRUE, nomatch = 0, mult = "all"][start <= end]
and got
# x start end
# 1: a 2 3
# 2: a 3 3
# 3: c 20 22
# 4: e 10 25
which is the right set of rows I want; However "pos" has become "start" and the original "start" has been lost. Is there a way to preserve all the columns with the roll join so i could report "start", "pos", "end" as desired?
Overlap joins was implemented with commit 1375 in data.table v1.9.3, and is available in the current stable release, v1.9.4. The function is called foverlaps. From NEWS:
29) Overlap joins #528 is now here, finally!! Except for type="equal" and maxgap and minoverlap arguments, everything else is implemented. Check out ?foverlaps and the examples there on its usage. This is a major feature addition to data.table.
Let's consider x, an interval defined as [a, b], where a <= b, and y, another interval defined as [c, d], where c <= d. The interval y is said to overlap x at all, iff d >= a and c <= b 1. And y is entirely contained within x, iff a <= c,d <= b 2. For the different types of overlaps implemented, please have a look at ?foverlaps.
Your question is a special case of an overlap join: in d1 you have true physical intervals with start and end positions. In d2 on the other hand, there are only positions (pos), not intervals. To be able to do an overlap join, we need to create intervals also in d2. This is achieved by creating an additional variable pos2, which is identical to pos (d2[, pos2 := pos]). Thus, we now have an interval in d2, albeit with identical start and end coordinates. This 'virtual, zero-width interval' in d2 can then be used in foverlap to do an overlap join with d1:
require(data.table) ## 1.9.3
setkey(d1)
d2[, pos2 := pos]
foverlaps(d2, d1, by.x = names(d2), type = "within", mult = "all", nomatch = 0L)
# x start end pos pos2
# 1: a 1 3 2 2
# 2: a 1 3 3 3
# 3: c 19 22 20 20
# 4: e 7 25 10 10
by.y by default is key(y), so we skipped it. by.x by default takes key(x) if it exists, and if not takes key(y). But a key doesn't exist for d2, and we can't set the columns from y, because they don't have the same names. So, we set by.x explicitly.
The type of overlap is within, and we'd like to have all matches, only if there is a match.
NB: foverlaps uses data.table's binary search feature (along with roll where necessary) under the hood, but some function arguments (types of overlaps, maxgap, minoverlap etc..) are inspired by the function findOverlaps() from the Bioconductor package IRanges, an excellent package (and so is GenomicRanges, which extends IRanges for Genomics).
So what's the advantage?
A benchmark on the code above on your data results in foverlaps() slower than Gabor's answer (Timings: Gabor's data.table solution = 0.004 vs foverlaps = 0.021 seconds). But does it really matter at this granularity?
What would be really interesting is to see how well it scales - in terms of both speed and memory. In Gabor's answer, we join based on the key column x. And then filter the results.
What if d1 has about 40K rows and d2 has a 100K rows (or more)? For each row in d2 that matches x in d1, all those rows will be matched and returned, only to be filtered later. Here's an example of your Q scaled only slightly:
Generate data:
require(data.table)
set.seed(1L)
n = 20e3L; k = 100e3L
idx1 = sample(100, n, TRUE)
idx2 = sample(100, n, TRUE)
d1 = data.table(x = sample(letters[1:5], n, TRUE),
start = pmin(idx1, idx2),
end = pmax(idx1, idx2))
d2 = data.table(x = sample(letters[1:15], k, TRUE),
pos1 = sample(60:150, k, TRUE))
foverlaps:
system.time({
setkey(d1)
d2[, pos2 := pos1]
ans1 = foverlaps(d2, d1, by.x=1:3, type="within", nomatch=0L)
})
# user system elapsed
# 3.028 0.635 3.745
This took ~ 1GB of memory in total, out of which ans1 is 420MB. Most of the time spent here is on subset really. You can check it by setting the argument verbose=TRUE.
Gabor's solutions:
## new session - data.table solution
system.time({
setkey(d1, x)
ans2 <- d1[d2, allow.cartesian=TRUE, nomatch=0L][between(pos1, start, end)]
})
# user system elapsed
# 15.714 4.424 20.324
And this took a total of ~3.5GB.
I just noted that Gabor already mentions the memory required for intermediate results. So, trying out sqldf:
# new session - sqldf solution
system.time(ans3 <- sqldf("select * from d1 join
d2 using (x) where pos1 between start and end"))
# user system elapsed
# 73.955 1.605 77.049
Took a total of ~1.4GB. So, it definitely uses less memory than the one shown above.
[The answers were verified to be identical after removing pos2 from ans1 and setting key on both answers.]
Note that this overlap join is designed with problems where d2 doesn't necessarily have identical start and end coordinates (ex: genomics, the field where I come from, where d2 is usually about 30-150 million or more rows).
foverlaps() is stable, but is still under development, meaning some arguments and names might get changed.
NB: Since I mentioned GenomicRanges above, it is also perfectly capable of solving this problem. It uses interval trees under the hood, and is quite memory efficient as well. In my benchmarks on genomics data, foverlaps() is faster. But that's for another (blog) post, some other time.
data.table v1.9.8+ has a new feature - non-equi joins. With that, this operation becomes even more straightforward:
require(data.table) #v1.9.8+
# no need to set keys on `d1` or `d2`
d2[d1, .(x, pos=x.pos, start, end), on=.(x, pos>=start, pos<=end), nomatch=0L]
# x pos start end
# 1: a 2 1 3
# 2: a 3 1 3
# 3: c 20 19 22
# 4: e 10 7 25
1) sqldf This is not data.table but complex join criteria are easy to specify in a straight forward manner in SQL:
library(sqldf)
sqldf("select * from d1 join d2 using (x) where pos between start and end")
giving:
x start end pos
1 a 1 3 2
2 a 1 3 3
3 c 19 22 20
4 e 7 25 10
2) data.table For a data.table answer try this:
library(data.table)
setkey(d1, x)
setkey(d2, x)
d1[d2][between(pos, start, end)]
giving:
x start end pos
1: a 1 3 2
2: a 1 3 3
3: c 19 22 20
4: e 7 25 10
Note that this does have the disadvantage of forming the possibly large intermeidate result d1[d2] which SQL may not do. The remaining solutions may have this problem too.
3) dplyr This suggests the corresponding dplyr solution. We also use between from data.table:
library(dplyr)
library(data.table) # between
d1 %>%
inner_join(d2) %>%
filter(between(pos, start, end))
giving:
Joining by: "x"
x start end pos
1 a 1 3 2
2 a 1 3 3
3 c 19 22 20
4 e 7 25 10
4) merge/subset Using only the base of R:
subset(merge(d1, d2), start <= pos & pos <= end)
giving:
x start end pos
1: a 1 3 2
2: a 1 3 3
3: c 19 22 20
4: e 7 25 10
Added Note that the data table solution here is much faster than the one in the other answer:
dt1 <- function() {
d1 <- data.table(x=letters[1:5], start=c(1,5,19,30, 7), end=c(3,11,22,39,25))
d2 <- data.table(x=letters[c(1,1,2,2,3:5)], pos=c(2,3,3,12,20,52,10))
setkey(d1, x, start)
idx1 = d1[d2, which=TRUE, roll=Inf] # last observation carried forwards
setkey(d1, x, end)
idx2 = d1[d2, which=TRUE, roll=-Inf] # next observation carried backwards
idx = which(!is.na(idx1) & !is.na(idx2))
ans1 <<- cbind(d1[idx1[idx]], d2[idx, list(pos)])
}
dt2 <- function() {
d1 <- data.table(x=letters[1:5], start=c(1,5,19,30, 7), end=c(3,11,22,39,25))
d2 <- data.table(x=letters[c(1,1,2,2,3:5)], pos=c(2,3,3,12,20,52,10))
setkey(d1, x)
ans2 <<- d1[d2][between(pos, start, end)]
}
all.equal(as.data.frame(ans1), as.data.frame(ans2))
## TRUE
benchmark(dt1(), dt2())[1:4]
## test replications elapsed relative
## 1 dt1() 100 1.45 1.667
## 2 dt2() 100 0.87 1.000 <-- from (2) above
Overlap joins are available in dplyr 1.1.0 via the function join_by.
With join_by, you can do overlap join with between, or manually with >= and <=:
library(dplyr)
inner_join(d2, d1, by = join_by(x, between(pos, start, end)))
# x pos start end
#1 a 2 1 3
#2 a 3 1 3
#3 c 20 19 22
#4 e 10 7 25
inner_join(d2, d1, by = join_by(x, pos >= start, pos <= end))
# x pos start end
#1 a 2 1 3
#2 a 3 1 3
#3 c 20 19 22
#4 e 10 7 25
Using fuzzyjoin :
result <- fuzzyjoin::fuzzy_inner_join(d1, d2,
by = c('x', 'pos' = 'start', 'pos' = 'end'),
match_fun = list(`==`, `>=`, `<=`))
result
# x.x pos x.y start end
# <chr> <dbl> <chr> <dbl> <dbl>
#1 a 2 a 1 3
#2 a 3 a 1 3
#3 c 20 c 19 22
#4 e 10 e 7 25
Since fuzzyjoin returns all the columns we might need to do some cleaning to keep the columns that we want.
library(dplyr)
result %>% select(x = x.x, pos, start, end)
# A tibble: 4 x 4
# x pos start end
# <chr> <dbl> <dbl> <dbl>
#1 a 2 1 3
#2 a 3 1 3
#3 c 20 19 22
#4 e 10 7 25

How do I select data for a period of time say a month for plotting?

I want to be able to visualise and analyse data for a specific period of time. The data is in different files so i read each file and appended them into an array. The data I have is in Julien date format so i used a function to convert it to datetime so that i am able to plot the data. However the function is now giving an error:
File "D:\DATA\TEC DATA\2018\amco\RES\CMN_append_df_stack_merge.py", line 162, in jd_to_date
jd = jd + 0.5
TypeError: can only concatenate str (not "float") to str
It was not giving this error before however the plot had lines crisscrossing everywhere which i suspect is because of the datetime not being formatted correctly.
Here is the code i used
*- coding: utf-8 -*-
"""
Created on Mon Mar 9 17:51:41 2020
#author: user
"""
import glob as glob
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from datetime import datetime
"""
Functions for converting dates to/from JD and MJD. Assumes dates are historical
dates, including the transition from the Julian calendar to the Gregorian
calendar in 1582. No support for proleptic Gregorian/Julian calendars.
:Author: Matt Davis
:Website: http://github.com/jiffyclub
"""
import math
import datetime as dt
# Note: The Python datetime module assumes an infinitely valid Gregorian calendar.
# The Gregorian calendar took effect after 10-15-1582 and the dates 10-05 through
# 10-14-1582 never occurred. Python datetime objects will produce incorrect
# time deltas if one date is from before 10-15-1582.
def mjd_to_jd(mjd):
"""
Convert Modified Julian Day to Julian Day.
Parameters
----------
mjd : float
Modified Julian Day
Returns
-------
jd : float
Julian Day
"""
return mjd + 2400000.5
def jd_to_mjd(jd):
"""
Convert Julian Day to Modified Julian Day
Parameters
----------
jd : float
Julian Day
Returns
-------
mjd : float
Modified Julian Day
"""
return jd - 2400000.5
def date_to_jd(year,month,day):
"""
Convert a date to Julian Day.
Algorithm from 'Practical Astronomy with your Calculator or Spreadsheet',
4th ed., Duffet-Smith and Zwart, 2011.
Parameters
----------
year : int
Year as integer. Years preceding 1 A.D. should be 0 or negative.
The year before 1 A.D. is 0, 10 B.C. is year -9.
month : int
Month as integer, Jan = 1, Feb. = 2, etc.
day : float
Day, may contain fractional part.
Returns
-------
jd : float
Julian Day
Examples
--------
Convert 6 a.m., February 17, 1985 to Julian Day
>>> date_to_jd(1985,2,17.25)
2446113.75
"""
if month == 1 or month == 2:
yearp = year - 1
monthp = month + 12
else:
yearp = year
monthp = month
# this checks where we are in relation to October 15, 1582, the beginning
# of the Gregorian calendar.
if ((year < 1582) or
(year == 1582 and month < 10) or
(year == 1582 and month == 10 and day < 15)):
# before start of Gregorian calendar
B = 0
else:
# after start of Gregorian calendar
A = math.trunc(yearp / 100.)
B = 2 - A + math.trunc(A / 4.)
if yearp < 0:
C = math.trunc((365.25 * yearp) - 0.75)
else:
C = math.trunc(365.25 * yearp)
D = math.trunc(30.6001 * (monthp + 1))
jd = B + C + D + day + 1720994.5
return jd
def jd_to_date(jd):
"""
Convert Julian Day to date.
Algorithm from 'Practical Astronomy with your Calculator or Spreadsheet',
4th ed., Duffet-Smith and Zwart, 2011.
Parameters
----------
jd : float
Julian Day
Returns
-------
year : int
Year as integer. Years preceding 1 A.D. should be 0 or negative.
The year before 1 A.D. is 0, 10 B.C. is year -9.
month : int
Month as integer, Jan = 1, Feb. = 2, etc.
day : float
Day, may contain fractional part.
Examples
--------
Convert Julian Day 2446113.75 to year, month, and day.
>>> jd_to_date(2446113.75)
(1985, 2, 17.25)
"""
jd = jd + 0.5
F, I = math.modf(jd)
I = int(I)
A = math.trunc((I - 1867216.25)/36524.25)
if I > 2299160:
B = I + 1 + A - math.trunc(A / 4.)
else:
B = I
C = B + 1524
D = math.trunc((C - 122.1) / 365.25)
E = math.trunc(365.25 * D)
G = math.trunc((C - E) / 30.6001)
day = C - E + F - math.trunc(30.6001 * G)
if G < 13.5:
month = G - 1
else:
month = G - 13
if month > 2.5:
year = D - 4716
else:
year = D - 4715
return year, month, day
def hmsm_to_days(hour=0,min=0,sec=0,micro=0):
"""
Convert hours, minutes, seconds, and microseconds to fractional days.
Parameters
----------
hour : int, optional
Hour number. Defaults to 0.
min : int, optional
Minute number. Defaults to 0.
sec : int, optional
Second number. Defaults to 0.
micro : int, optional
Microsecond number. Defaults to 0.
Returns
-------
days : float
Fractional days.
Examples
--------
>>> hmsm_to_days(hour=6)
0.25
"""
days = sec + (micro / 1.e6)
days = min + (days / 60.)
days = hour + (days / 60.)
return days / 24.
def days_to_hmsm(days):
"""
Convert fractional days to hours, minutes, seconds, and microseconds.
Precision beyond microseconds is rounded to the nearest microsecond.
Parameters
----------
days : float
A fractional number of days. Must be less than 1.
Returns
-------
hour : int
Hour number.
min : int
Minute number.
sec : int
Second number.
micro : int
Microsecond number.
Raises
------
ValueError
If `days` is >= 1.
Examples
--------
>>> days_to_hmsm(0.1)
(2, 24, 0, 0)
"""
hours = days * 24.
hours, hour = math.modf(hours)
mins = hours * 60.
mins, min = math.modf(mins)
secs = mins * 60.
secs, sec = math.modf(secs)
micro = round(secs * 1.e6)
return int(hour), int(min), int(sec), int(micro)
def datetime_to_jd(date):
"""
Convert a `datetime.datetime` object to Julian Day.
Parameters
----------
date : `datetime.datetime` instance
Returns
-------
jd : float
Julian day.
Examples
--------
>>> d = datetime.datetime(1985,2,17,6)
>>> d
datetime.datetime(1985, 2, 17, 6, 0)
>>> jdutil.datetime_to_jd(d)
2446113.75
"""
days = date.day + hmsm_to_days(date.hour,date.minute,date.second,date.microsecond)
return date_to_jd(date.year,date.month,days)
def jd_to_datetime(jd):
"""
Convert a Julian Day to an `jdutil.datetime` object.
Parameters
----------
jd : float
Julian day.
Returns
-------
dt : `jdutil.datetime` object
`jdutil.datetime` equivalent of Julian day.
--------------------------------------------------------------------------------------------------------------------------------------------------------
--------
>>> jd_to_datetime(2446113.75)
datetime(1985, 2, 17, 6, 0)
"""
year, month, day = jd_to_date(jd)
frac_days,day = math.modf(day)
day = int(day)
hour,min,sec,micro = days_to_hmsm(frac_days)
return datetime(year,month,day,hour,min,sec,micro)
'---------------------------------------------------------------------------------------------------------------------------------------------------------'
#"""
#JULIEN DAY CONVERTOR
def timedelta_to_days(td):
"""
Convert a `datetime.timedelta` object to a total number of days.
Parameters
----------
td : `datetime.timedelta` instance
Returns
-------
days : float
Total number of days in the `datetime.timedelta` object.
Examples
--------
>>> td = datetime.timedelta(4.5)
>>> td
datetime.timedelta(4, 43200)
>>> timedelta_to_days(td)
4.5
"""
seconds_in_day = 24. * 3600.
days = td.days + (td.seconds + (td.microseconds * 10.e6)) / seconds_in_day
return days
"""
'Start of main Code---------------------------------------------------------------------------------------------------'
"""
#########
files = glob.glob('*.Cmn')
files = files[:3]
np_array_values = []
for file in files:
df = pd.read_csv(file, delimiter="\t", sep ='\t', skiprows = 5, names = ["Jdate" ,'Time' ,'PRN' ,'Az','Ele','Lat', 'Lon' ,'Stec', 'Vtec', 'S4'])
#df.set_index('Jdatet')
#Appending Each data frame to the Giant Array np-array_values then Stacking them
np_array_values.append (df)
merge_values = np.vstack(np_array_values)
#CONVERTING ARRAY BACK TO DATA FRAME
TEC_data =pd.DataFrame(merge_values)
Vtec = TEC_data.loc[:,7]
jdate = TEC_data.loc[:,0]
Time = TEC_data.loc[:,1]
month_name = file[:-7]
STATION_NAME = file[:4]
df.replace('-99.000', np.nan)
pos= np.where(jdate)[0]
pos1=np.where(Vtec[pos]<-20.0)[0]
Vtec[pos1]='nan'
fulldate = []
for i in jdate:
a = jd_to_datetime(i)
fulldate.append(a)
#plt.show()
plt.plot(fulldate, Vtec)
#plt.xlim(0, 24)
#plt.xticks(np.arange(0, 26, 2))
plt.ylabel('TECU')
plt.xlabel('Date')
#plt.grid(axis='both')
plt.title("Station : " + STATION_NAME.upper())t**
Here is a sample of the data. The columns are:
nknown_station, "E:\GPS DATA\rbmc2\2018\amco\amco3630.18o"
-4.87199 294.66602 75.87480
Jdatet Time PRN Az Ele Lat Lon Stec Vtec S4
2458481.500000 -24.000000 1 198.34 23.37 -10.70 292.70 18.62 13.70 -99.000
2458481.500347 0.008333 1 198.21 23.53 -10.67 292.73 18.65 13.74 -99.000
2458481.500694 0.016667 1 198.07 23.69 -10.64 292.75 18.76 13.84 -99.000
2458481.501042 0.025000 1 197.94 23.85 -10.61 292.78 18.68 13.83 -99.000
2458481.501389 0.033333 1 197.81 24.01 -10.58 292.80 18.60 13.83 -99.000
2458481.501736 0.041667 1 197.68 24.17 -10.55 292.83 18.53 13.83 -99.000
2458481.502083 0.050000 1 197.54 24.33 -10.52 292.85 18.53 13.86 -99.000
2458481.502431 0.058333 1 197.41 24.49 -10.49 292.88 18.51 13.88 -99.000
2458481.502778 0.066667 1 197.28 24.65 -10.46 292.90 18.66 14.00 -99.000
2458481.503125 0.075000 1 197.15 24.81 -10.43 292.92 18.78 14.09 0.238
2458481.503472 0.083333 1 197.01 24.98 -10.40 292.95 18.55 14.01 -99.000
2458481.503819 0.091667 1 196.88 25.14 -10.37 292.97 18.39 13.96 -99.000
2458481.504167 0.100000 1 196.75 25.30 -10.34 292.99 18.33 13.97 -99.000
2458481.504514 0.108333 1 196.62 25.47 -10.31 293.02 18.20 13.94 -99.000
2458481.504861 0.116667 1 196.49 25.63 -10.28 293.04 17.61 13.67 -99.000
2458481.505208 0.125000 1 196.36 25.80 -10.25 293.06 16.74 13.25 -99.000
2458481.505556 0.133333 1 196.23 25.96 -10.22 293.09 16.06 12.92 -99.000
2458481.505903 0.141667 1 196.10 26.13 -10.19 293.11 15.46 12.64 -99.000
2458481.506250 0.150000 1 195.97 26.30 -10.16 293.13 14.77 12.31 -99.000
2458481.506597 0.158333 1 195.84 26.46 -10.13 293.15 14.42 12.15 0.127
It seems like your variable jd is a string instead of a float. Can you show some sample data? This would probably explain why this is happening.
Some additional points about your code:
While looping over all files, you want to create one dataframe. This can be done easier:
dfMerged = pd.DataFrame()
for file in files:
dfTemp = pd.read_csv(...)
dfMerged = dfMerged.append(dfTemp, ignore_index=True)
This would lead to the additional benefit that you can use your column headings like they are supposed to be used, e.g., Vtec = dfMerged['Vtec'].
Your month_name and STATION_NAME are overwritten in each iteration of the loop. If they are different for each file, you want to save them as a list or as a separate column in the dataframe, e.g., dfTemp['Station'] = STATION_NAME. Keep in mind to do this before you merge the dataframes.
Also, the replace can be substituted by adding '-99.000' to the na_values argument in read_csv.
Selecting your data can be done with pandas. dfMerged.loc[(dfMerged['Jdate'].notna()) & (dfMerged['Vtec'] < -20), 'Vtec'] = pd.NA.
I added the list containing the datetime object as an index to the Dataframe.
Apparently to recognise the dates you need to use plt.plot_date instead of using the usual plt.plot. Datetime index will take care of the rest.

How to delete "1" followed by trailing zeros from Data Frame row values ?

From my "Id" Column I want to remove the one and zero's from the left.
That is
1000003 becomes 3
1000005 becomes 5
1000011 becomes 11 and so on
Ignore -1, 10 and 1000000, they will be handled as special cases. but from the remaining rows I want to remove the "1" followed by zeros.
Well you can use modulus to get the end of the numbers (they will be the remainder). So just exclude the rows with ids of [-1,10,1000000] and then compute the modulus of 1000000:
print df
Id
0 -1
1 10
2 1000000
3 1000003
4 1000005
5 1000007
6 1000009
7 1000011
keep = df.Id.isin([-1,10,1000000])
df.Id[~keep] = df.Id[~keep] % 1000000
print df
Id
0 -1
1 10
2 1000000
3 3
4 5
5 7
6 9
7 11
Edit: Here is a fully vectorized string slice version as an alternative (Like Alex' method but takes advantage of pandas' vectorized string methods):
keep = df.Id.isin([-1,10,1000000])
df.Id[~keep] = df.Id[~keep].astype(str).str[1:].astype(int)
print df
Id
0 -1
1 10
2 1000000
3 3
4 5
5 7
6 9
7 11
Here is another way you could try to do it:
def f(x):
"""convert the value to a string, then select only the characters
after the first one in the string, which is 1. For example,
100005 would be 00005 and I believe it's returning 00005.0 from
dataframe, which is why the float() is there. Then just convert
it to an int, and you'll have 5, etc.
"""
return int(float(str(x)[1:]))
# apply the function "f" to the dataframe and pass in the column 'Id'
df.apply(lambda row: f(row['Id']), axis=1)
I get that this question is satisfactory answered. But for future visitors, what I like about alex' answer is that it does not depend on there to be exactly four zeros. The accepted answer will fail if you sometimes have 10005, sometimes 1000005 and whatever.
However, to add something more to the way we think about it. If you know it's always going to be 10000, you can do
# backup all values
foo = df.id
#now, some will be negative or zero
df.id = df.id - 10000
#back in those that are negative or zero (here, first three rows)
df.if[df.if <= 0] = foo[df.id <= 0]
It gives you the same as Karl's answer, but I typically prefer these kind of methods for their readability.

Most efficient way to shift MultiIndex time series

I have a DataFrame that consists of many stacked time series. The index is (poolId, month) where both are integers, the "month" being the number of months since 2000. What's the best way to calculate one-month lagged versions of multiple variables?
Right now, I do something like:
cols_to_shift = ["bal", ...5 more columns...]
df_shift = df[cols_to_shift].groupby(level=0).transform(lambda x: x.shift(-1))
For my data, this took me a full 60 s to run. (I have 48k different pools and a total of 718k rows.)
I'm converting this from R code and the equivalent data.table call:
dt.shift <- dt[, list(bal=myshift(bal), ...), by=list(poolId)]
only takes 9 s to run. (Here "myshift" is something like "function(x) c(x[-1], NA)".)
Is there a way I can get the pandas verison to be back in line speed-wise? I tested this on 0.8.1.
Edit: Here's an example of generating a close-enough data set, so you can get some idea of what I mean:
ids = np.arange(48000)
lens = np.maximum(np.round(15+9.5*np.random.randn(48000)), 1.0).astype(int)
id_vec = np.repeat(ids, lens)
lens_shift = np.concatenate(([0], lens[:-1]))
mon_vec = np.arange(lens.sum()) - np.repeat(np.cumsum(lens_shift), lens)
n = len(mon_vec)
df = pd.DataFrame.from_items([('pool', id_vec), ('month', mon_vec)] + [(c, np.random.rand(n)) for c in 'abcde'])
df = df.set_index(['pool', 'month'])
%time df_shift = df.groupby(level=0).transform(lambda x: x.shift(-1))
That took 64 s when I tried it. This data has every series starting at month 0; really, they should all end at month np.max(lens), with ragged start dates, but good enough.
Edit 2: Here's some comparison R code. This takes 0.8 s. Factor of 80, not good.
library(data.table)
ids <- 1:48000
lens <- as.integer(pmax(1, round(rnorm(ids, mean=15, sd=9.5))))
id.vec <- rep(ids, times=lens)
lens.shift <- c(0, lens[-length(lens)])
mon.vec <- (1:sum(lens)) - rep(cumsum(lens.shift), times=lens)
n <- length(id.vec)
dt <- data.table(pool=id.vec, month=mon.vec, a=rnorm(n), b=rnorm(n), c=rnorm(n), d=rnorm(n), e=rnorm(n))
setkey(dt, pool, month)
myshift <- function(x) c(x[-1], NA)
system.time(dt.shift <- dt[, list(month=month, a=myshift(a), b=myshift(b), c=myshift(c), d=myshift(d), e=myshift(e)), by=pool])
I would suggest you reshape the data and do a single shift versus the groupby approach:
result = df.unstack(0).shift(1).stack()
This switches the order of the levels so you'd want to swap and reorder:
result = result.swaplevel(0, 1).sortlevel(0)
You can verify it's been lagged by one period (you want shift(1) instead of shift(-1)):
In [17]: result.ix[1]
Out[17]:
a b c d e
month
1 0.752511 0.600825 0.328796 0.852869 0.306379
2 0.251120 0.871167 0.977606 0.509303 0.809407
3 0.198327 0.587066 0.778885 0.565666 0.172045
4 0.298184 0.853896 0.164485 0.169562 0.923817
5 0.703668 0.852304 0.030534 0.415467 0.663602
6 0.851866 0.629567 0.918303 0.205008 0.970033
7 0.758121 0.066677 0.433014 0.005454 0.338596
8 0.561382 0.968078 0.586736 0.817569 0.842106
9 0.246986 0.829720 0.522371 0.854840 0.887886
10 0.709550 0.591733 0.919168 0.568988 0.849380
11 0.997787 0.084709 0.664845 0.808106 0.872628
12 0.008661 0.449826 0.841896 0.307360 0.092581
13 0.727409 0.791167 0.518371 0.691875 0.095718
14 0.928342 0.247725 0.754204 0.468484 0.663773
15 0.934902 0.692837 0.367644 0.061359 0.381885
16 0.828492 0.026166 0.050765 0.524551 0.296122
17 0.589907 0.775721 0.061765 0.033213 0.793401
18 0.532189 0.678184 0.747391 0.199283 0.349949
In [18]: df.ix[1]
Out[18]:
a b c d e
month
0 0.752511 0.600825 0.328796 0.852869 0.306379
1 0.251120 0.871167 0.977606 0.509303 0.809407
2 0.198327 0.587066 0.778885 0.565666 0.172045
3 0.298184 0.853896 0.164485 0.169562 0.923817
4 0.703668 0.852304 0.030534 0.415467 0.663602
5 0.851866 0.629567 0.918303 0.205008 0.970033
6 0.758121 0.066677 0.433014 0.005454 0.338596
7 0.561382 0.968078 0.586736 0.817569 0.842106
8 0.246986 0.829720 0.522371 0.854840 0.887886
9 0.709550 0.591733 0.919168 0.568988 0.849380
10 0.997787 0.084709 0.664845 0.808106 0.872628
11 0.008661 0.449826 0.841896 0.307360 0.092581
12 0.727409 0.791167 0.518371 0.691875 0.095718
13 0.928342 0.247725 0.754204 0.468484 0.663773
14 0.934902 0.692837 0.367644 0.061359 0.381885
15 0.828492 0.026166 0.050765 0.524551 0.296122
16 0.589907 0.775721 0.061765 0.033213 0.793401
17 0.532189 0.678184 0.747391 0.199283 0.349949
Perf isn't too bad with this method (it might be a touch slower in 0.9.0):
In [19]: %time result = df.unstack(0).shift(1).stack()
CPU times: user 1.46 s, sys: 0.24 s, total: 1.70 s
Wall time: 1.71 s