How can I merge two data frames on a range of dates? [duplicate]

How can I merge two data frames on a range of dates? [duplicate] - sql

Consider the following data.tables. The first defines a set of regions with start and end positions for each group 'x':
library(data.table)
d1 <- data.table(x = letters[1:5], start = c(1,5,19,30, 7), end = c(3,11,22,39,25))
setkey(d1, x, start)
# x start end
# 1: a 1 3
# 2: b 5 11
# 3: c 19 22
# 4: d 30 39
# 5: e 7 25
The second data set has the same grouping variable 'x', and positions 'pos' within each group:
d2 <- data.table(x = letters[c(1,1,2,2,3:5)], pos = c(2,3,3,12,20,52,10))
setkey(d2, x, pos)
# x pos
# 1: a 2
# 2: a 3
# 3: b 3
# 4: b 12
# 5: c 20
# 6: d 52
# 7: e 10
Ultimately I'd like to extract the rows in 'd2' where 'pos' falls within the range defined by 'start' and 'end', within each group x. The desired result is
# x pos start end
# 1: a 2 1 3
# 2: a 3 1 3
# 3: c 20 19 22
# 4: e 10 7 25
The start/end positions for any group x will never overlap but there may be gaps of values not in any region.
Now, I believe I should be using a rolling join. From what i can tell, I cannot use the "end" column in the join.
I've tried
d1[d2, roll = TRUE, nomatch = 0, mult = "all"][start <= end]
and got
# x start end
# 1: a 2 3
# 2: a 3 3
# 3: c 20 22
# 4: e 10 25
which is the right set of rows I want; However "pos" has become "start" and the original "start" has been lost. Is there a way to preserve all the columns with the roll join so i could report "start", "pos", "end" as desired?

Overlap joins was implemented with commit 1375 in data.table v1.9.3, and is available in the current stable release, v1.9.4. The function is called foverlaps. From NEWS:
29) Overlap joins #528 is now here, finally!! Except for type="equal" and maxgap and minoverlap arguments, everything else is implemented. Check out ?foverlaps and the examples there on its usage. This is a major feature addition to data.table.
Let's consider x, an interval defined as [a, b], where a <= b, and y, another interval defined as [c, d], where c <= d. The interval y is said to overlap x at all, iff d >= a and c <= b 1. And y is entirely contained within x, iff a <= c,d <= b 2. For the different types of overlaps implemented, please have a look at ?foverlaps.
Your question is a special case of an overlap join: in d1 you have true physical intervals with start and end positions. In d2 on the other hand, there are only positions (pos), not intervals. To be able to do an overlap join, we need to create intervals also in d2. This is achieved by creating an additional variable pos2, which is identical to pos (d2[, pos2 := pos]). Thus, we now have an interval in d2, albeit with identical start and end coordinates. This 'virtual, zero-width interval' in d2 can then be used in foverlap to do an overlap join with d1:
require(data.table) ## 1.9.3
setkey(d1)
d2[, pos2 := pos]
foverlaps(d2, d1, by.x = names(d2), type = "within", mult = "all", nomatch = 0L)
# x start end pos pos2
# 1: a 1 3 2 2
# 2: a 1 3 3 3
# 3: c 19 22 20 20
# 4: e 7 25 10 10
by.y by default is key(y), so we skipped it. by.x by default takes key(x) if it exists, and if not takes key(y). But a key doesn't exist for d2, and we can't set the columns from y, because they don't have the same names. So, we set by.x explicitly.
The type of overlap is within, and we'd like to have all matches, only if there is a match.
NB: foverlaps uses data.table's binary search feature (along with roll where necessary) under the hood, but some function arguments (types of overlaps, maxgap, minoverlap etc..) are inspired by the function findOverlaps() from the Bioconductor package IRanges, an excellent package (and so is GenomicRanges, which extends IRanges for Genomics).
So what's the advantage?
A benchmark on the code above on your data results in foverlaps() slower than Gabor's answer (Timings: Gabor's data.table solution = 0.004 vs foverlaps = 0.021 seconds). But does it really matter at this granularity?
What would be really interesting is to see how well it scales - in terms of both speed and memory. In Gabor's answer, we join based on the key column x. And then filter the results.
What if d1 has about 40K rows and d2 has a 100K rows (or more)? For each row in d2 that matches x in d1, all those rows will be matched and returned, only to be filtered later. Here's an example of your Q scaled only slightly:
Generate data:
require(data.table)
set.seed(1L)
n = 20e3L; k = 100e3L
idx1 = sample(100, n, TRUE)
idx2 = sample(100, n, TRUE)
d1 = data.table(x = sample(letters[1:5], n, TRUE),
start = pmin(idx1, idx2),
end = pmax(idx1, idx2))
d2 = data.table(x = sample(letters[1:15], k, TRUE),
pos1 = sample(60:150, k, TRUE))
foverlaps:
system.time({
setkey(d1)
d2[, pos2 := pos1]
ans1 = foverlaps(d2, d1, by.x=1:3, type="within", nomatch=0L)
})
# user system elapsed
# 3.028 0.635 3.745
This took ~ 1GB of memory in total, out of which ans1 is 420MB. Most of the time spent here is on subset really. You can check it by setting the argument verbose=TRUE.
Gabor's solutions:
## new session - data.table solution
system.time({
setkey(d1, x)
ans2 <- d1[d2, allow.cartesian=TRUE, nomatch=0L][between(pos1, start, end)]
})
# user system elapsed
# 15.714 4.424 20.324
And this took a total of ~3.5GB.
I just noted that Gabor already mentions the memory required for intermediate results. So, trying out sqldf:
# new session - sqldf solution
system.time(ans3 <- sqldf("select * from d1 join
d2 using (x) where pos1 between start and end"))
# user system elapsed
# 73.955 1.605 77.049
Took a total of ~1.4GB. So, it definitely uses less memory than the one shown above.
[The answers were verified to be identical after removing pos2 from ans1 and setting key on both answers.]
Note that this overlap join is designed with problems where d2 doesn't necessarily have identical start and end coordinates (ex: genomics, the field where I come from, where d2 is usually about 30-150 million or more rows).
foverlaps() is stable, but is still under development, meaning some arguments and names might get changed.
NB: Since I mentioned GenomicRanges above, it is also perfectly capable of solving this problem. It uses interval trees under the hood, and is quite memory efficient as well. In my benchmarks on genomics data, foverlaps() is faster. But that's for another (blog) post, some other time.

data.table v1.9.8+ has a new feature - non-equi joins. With that, this operation becomes even more straightforward:
require(data.table) #v1.9.8+
# no need to set keys on `d1` or `d2`
d2[d1, .(x, pos=x.pos, start, end), on=.(x, pos>=start, pos<=end), nomatch=0L]
# x pos start end
# 1: a 2 1 3
# 2: a 3 1 3
# 3: c 20 19 22
# 4: e 10 7 25

1) sqldf This is not data.table but complex join criteria are easy to specify in a straight forward manner in SQL:
library(sqldf)
sqldf("select * from d1 join d2 using (x) where pos between start and end")
giving:
x start end pos
1 a 1 3 2
2 a 1 3 3
3 c 19 22 20
4 e 7 25 10
2) data.table For a data.table answer try this:
library(data.table)
setkey(d1, x)
setkey(d2, x)
d1[d2][between(pos, start, end)]
giving:
x start end pos
1: a 1 3 2
2: a 1 3 3
3: c 19 22 20
4: e 7 25 10
Note that this does have the disadvantage of forming the possibly large intermeidate result d1[d2] which SQL may not do. The remaining solutions may have this problem too.
3) dplyr This suggests the corresponding dplyr solution. We also use between from data.table:
library(dplyr)
library(data.table) # between
d1 %>%
inner_join(d2) %>%
filter(between(pos, start, end))
giving:
Joining by: "x"
x start end pos
1 a 1 3 2
2 a 1 3 3
3 c 19 22 20
4 e 7 25 10
4) merge/subset Using only the base of R:
subset(merge(d1, d2), start <= pos & pos <= end)
giving:
x start end pos
1: a 1 3 2
2: a 1 3 3
3: c 19 22 20
4: e 7 25 10
Added Note that the data table solution here is much faster than the one in the other answer:
dt1 <- function() {
d1 <- data.table(x=letters[1:5], start=c(1,5,19,30, 7), end=c(3,11,22,39,25))
d2 <- data.table(x=letters[c(1,1,2,2,3:5)], pos=c(2,3,3,12,20,52,10))
setkey(d1, x, start)
idx1 = d1[d2, which=TRUE, roll=Inf] # last observation carried forwards
setkey(d1, x, end)
idx2 = d1[d2, which=TRUE, roll=-Inf] # next observation carried backwards
idx = which(!is.na(idx1) & !is.na(idx2))
ans1 <<- cbind(d1[idx1[idx]], d2[idx, list(pos)])
}
dt2 <- function() {
d1 <- data.table(x=letters[1:5], start=c(1,5,19,30, 7), end=c(3,11,22,39,25))
d2 <- data.table(x=letters[c(1,1,2,2,3:5)], pos=c(2,3,3,12,20,52,10))
setkey(d1, x)
ans2 <<- d1[d2][between(pos, start, end)]
}
all.equal(as.data.frame(ans1), as.data.frame(ans2))
## TRUE
benchmark(dt1(), dt2())[1:4]
## test replications elapsed relative
## 1 dt1() 100 1.45 1.667
## 2 dt2() 100 0.87 1.000 <-- from (2) above

Overlap joins are available in dplyr 1.1.0 via the function join_by.
With join_by, you can do overlap join with between, or manually with >= and <=:
library(dplyr)
inner_join(d2, d1, by = join_by(x, between(pos, start, end)))
# x pos start end
#1 a 2 1 3
#2 a 3 1 3
#3 c 20 19 22
#4 e 10 7 25
inner_join(d2, d1, by = join_by(x, pos >= start, pos <= end))
# x pos start end
#1 a 2 1 3
#2 a 3 1 3
#3 c 20 19 22
#4 e 10 7 25

Using fuzzyjoin :
result <- fuzzyjoin::fuzzy_inner_join(d1, d2,
by = c('x', 'pos' = 'start', 'pos' = 'end'),
match_fun = list(`==`, `>=`, `<=`))
result
# x.x pos x.y start end
# <chr> <dbl> <chr> <dbl> <dbl>
#1 a 2 a 1 3
#2 a 3 a 1 3
#3 c 20 c 19 22
#4 e 10 e 7 25
Since fuzzyjoin returns all the columns we might need to do some cleaning to keep the columns that we want.
library(dplyr)
result %>% select(x = x.x, pos, start, end)
# A tibble: 4 x 4
# x pos start end
# <chr> <dbl> <dbl> <dbl>
#1 a 2 1 3
#2 a 3 1 3
#3 c 20 19 22
#4 e 10 7 25

Related

How to get same grouping result using data.table comparing to the sqldf?

I try to implement SQL query using sqldf and data.table.
I need to do this separately using these 2 different libraries.
Unfortunately, I cannot produce the same result using data.table.
library(sqldf)
library(data.table)
Id <- c(1,2,3,4)
HasPet <- c(0,0,1,1)
Age <- c(20,1,14,10)
Posts <- data.table(Id, HasPet, Age)
# sqldf way
ref <- sqldf("
SELECT Id, HasPet, MAX(Age) AS MaxAge
FROM Posts
GROUP BY HasPet
")
# data.table way
res <- Posts[,
list(Id, HasPet, MaxAge=max(Age)),
by=list(HasPet)]
head(ref)
head(res)
Output for sqldf is:
> head(ref)
Id HasPet MaxAge
1 1 0 20
2 3 1 14
while the output for data.table is different:
> head(res)
HasPet Id HasPet MaxAge
1: 0 1 0 20
2: 0 2 0 20
3: 1 3 1 14
4: 1 4 1 14
Please note, that SQL query cannot be modified.

This comes up a lot with data.table. If you want the max or min by group, the best way is a self-join. It's fast, and only a little arcane.
You can build it up step by step:
In data.table, you can select in i, do in j, and group afterwards. So first step is to find the thing we want within each level of the group
Posts[, Age == max(Age), by = HasPet]
# HasPet V1
# 1: 0 TRUE
# 2: 0 FALSE
# 3: 1 TRUE
# 4: 1 FALSE
We can use .I to retrieve the integer vector per row, then what was previously the V1 logical vector TRUE and FALSE indexes within each group so we have only the row containing the max per group.
Posts[, .I[Age == max(Age)], by=HasPet]
# From the data.table special symbols help:
# .I is an integer vector equal to seq_len(nrow(x)). While grouping,
# it holds for each item in the group, its row location in x. This is useful
# to subset in j; e.g. DT[, .I[which.max(somecol)], by=grp].
# HasPet V1
# 1: 0 1
# 2: 1 3
We then use the column V1 that we just made in order to call the specific rows (1 and 3) from the data.table. That's it!
Posts[Posts[, .I[Age == max(Age)], by=HasPet]$V1]

You can use .SD to get subset of rows for each value of HasPet.
library(data.table)
Posts[, .SD[Age==max(Age)], HasPet]
# HasPet Id Age
#1: 0 1 20
#2: 1 3 14

Loop through irregular list of numbers to append rows to summary table

I'm trying to write code that will loop through a list of integers, which relate to a number of sensors, to provide summary statistics (at this stage just cor()).
# GOOD TO HERE
corr_table <-data.frame(ID = integer()
, HxT = double())
for(j in gt_thrsh_key){ #this is currently set to 2:5 for testing - its a list of sensors I want to summarise
# extract humidity and time vectors
x <- sqldf(sprintf("SELECT humidity FROM data_agg_2 WHERE ID = %s",j))
y <- sqldf(sprintf("SELECT time_elapsed FROM data_agg_2 WHERE ID = %s",j))
# format into row
new_row <- data.frame(ID = c(j), HxT = c(cor(x,y))) #insert new variables into row
# append to dataframe
corr_table <- rbind(corr_table, new_row)
print(sprintf("Sensor %s has been summarised.",j)) # check 1
print(cor(x,y)) # check 2
}
print(corr_table)
assign("data_agg_2", data_agg_2, envir = .GlobalEnv)
I get output:
[1] "Sensor 2 has been summarised." "Sensor 3 has been summarised." "Sensor 4 has been summarised." "Sensor 5 has been summarised."
humidity -0.08950285
ID HxT
1 2 -0.08950285 #INCORRECT
2 3 -0.08950285 #INCORRECT
3 4 -0.08950285 #INCORRECT
4 5 -0.08950285 #correct
This is only the correct measurement for the final iteration of loop (id = 5), so somehow I must be overwriting previous entries. Does anyone know why this is happening? Or can you recommend a better way to perform this loop?
Thanks!!
EDIT: check 2 which prints the cor() of x and y through the loop confirms that only the final run of loop is calculating a value. Has anyone seen this before?

Here is a base R solution that uses lapply() to generate the correlations and write them to a list(). The list is converted to a data frame with do.call(rbind,...).
# simulate some data
set.seed(19041798) # ensure consistency across multiple runs
ID <- rep(1:10,20)
humidity <- rnorm(200,mean = 30,sd = 15)
elapsed_time <- rpois(200,2.5)
data <- data.frame(ID,humidity, elapsed_time)
uniqueIDs <- unique(data$ID)
correlationList <- lapply(uniqueIDs,function(x){
y <- subset(data,ID == x)
HxT <- cor(y$humidity,y$elapsed_time)
# return as data frame
data.frame(ID = x,HxT = HxT)
})
correlations <- do.call(rbind,correlationList)
...and the output:
> correlations
ID HxT
1 1 -0.1805885
2 2 -0.3166290
3 3 0.1749233
4 4 -0.2517737
5 5 0.1428092
6 6 0.3112812
7 7 -0.3180825
8 8 0.3774637
9 9 -0.3790178
10 10 -0.3070866
>
sqldf() version
We can restructure the code from the original post so it extracts all the data it needs through a single SQL query, and performs all subsequent processing in R.
First, we simulate 60,000 rows of data.
set.seed(19041798) # ensure consistency across multiple runs
ID <- rep(1:30,2000)
humidity <- rnorm(60000,mean = 30,sd = 15)
elapsed_time <- rpois(60000,2.5)
data <- data.frame(ID,humidity, elapsed_time)
Next, we extract data for the first 5 sensors from the data with sqldf(), as well as the vector of uniqueIDs.
library(sqldf)
# select ID <= 5
sqlStmt <- "select ID, humidity,elapsed_time from data where ID <= 5"
dataSubset <- sqldf(sqlStmt)
sqlStmt <- "select distinct ID from data where ID <= 5"
uniqueIDs <- sqldf(sqlStmt)[[1]]
At this point, the dataSubset data frame has 10,000 observations. We use lapply() with the vector of uniqueIDs to generate correlations by ID, count the complete.cases() included in each correlation, and write the results to a list of data frames.
correlationList <- lapply(uniqueIDs,function(x){
y <- subset(dataSubset,ID == x)
count <- sum(complete.cases(y)) # number of obs included in cor()
HxT <- cor(y$humidity,y$elapsed_time)
# return as data frame
data.frame(ID = x,count = count,HxT = HxT)
})
Finally, a do.call(rbind,...) and a print, and we have our list of correlations including counts of rows used to calculate the correlation.
correlations <- do.call(rbind,correlationList)
correlations
...and the output:
> correlations
ID count HxT
1 1 2000 0.015640244
2 2 2000 0.017143573
3 3 2000 -0.011283180
4 4 2000 0.052482666
5 5 2000 0.002083603
>

find closest match within a vector to fill missing values using dplyr

A dummy dataset is :
data <- data.frame(
group = c(1,1,1,1,1,2),
dates = as.Date(c("2005-01-01", "2006-05-01", "2007-05-01","2004-08-01",
"2005-03-01","2010-02-01")),
value = c(10,20,NA,40,NA,5)
)
For each group, the missing values need to be filled with the non-missing value corresponding to the nearest date within same group. In case of a tie, pick any.
I am using dplyr. which.closest from birk but it needs a vector and a value. How to look up within a vector without writing loops. Even if there is an SQL solution, will do.
Any pointers to the solution?
May be something like: value = value[match(which.closest(dates,THISdate) & !is.na(value))]
Not sure how to specify Thisdate.
Edit: The expected value vector should look like:
value = c(10,20,20,40,10,5)

Using knn1 (nearest neighbor) from the class package (which comes with R -- don't need to install it) and dplyr define an na.knn1 function which replaces each NA value in x with the non-NA x value having the closest time.
library(class)
na.knn1 <- function(x, time) {
is_na <- is.na(x)
if (sum(is_na) == 0 || all(is_na)) return(x)
train <- matrix(time[!is_na])
test <- matrix(time[is_na])
cl <- x[!is_na]
x[is_na] <- as.numeric(as.character(knn1(train, test, cl)))
x
}
data %>% mutate(value = na.knn1(value, dates))
giving:
group dates value
1 1 2005-01-01 10
2 1 2006-05-01 20
3 1 2007-05-01 20
4 1 2004-08-01 40
5 1 2005-03-01 10
6 2 2010-02-01 5
Add an appropriate group_by if the intention was to do this by group.

You can try the use of sapply to find the values closest since the x argument in `which.closest only takes a single value.
first create a vect whereby the dates with no values are replaced with NA and use it within the which.closest function.
library(birk)
vect=replace(data$dates,which(is.na(data$value)),NA)
transform(data,value=value[sapply(dates,which.closest,vec=vect)])
group dates value
1 1 2005-01-01 10
2 1 2006-05-01 20
3 1 2007-05-01 20
4 1 2004-08-01 40
5 1 2005-03-01 10
6 2 2010-02-01 5
if which.closest was to take a vector then there would be no need of sapply. But this is not the case.
Using the dplyr package:
library(birk)
library(dplyr)
data%>%mutate(vect=`is.na<-`(dates,is.na(value)),
value=value[sapply(dates,which.closest,vect)])%>%
select(-vect)

Replicate a SQL Join + Group By in data.table

I am trying to replicate a SQL sparse matrix multiplication using data tables. The SQL expression would be:
SELECT a.i, b.j, SUM(a.value*b.value)
FROM a, b
WHERE a.j = b.i
GROUP BY a.i, b.j;
where my data is structured as |i|j|value in each table
To create this data in R you can use:
library(reshape2)
library(data.table)
A <- matrix(runif(25),5,5)
B <- matrix(runif(25),5,5)
ADT <- data.table(melt(A))
BDT <- data.table(melt(B))
setnames(ADT,old = c("Var1","Var2","value"), new = c("Ai","Aj","AVal"))
setnames(BDT,old = c("Var1","Var2","value"), new = c("Bi","Bj","BVal"))
To merge using .[ we need to set the keys that we join on:
setkey(ADT,"Aj")
setkey(BDT,"Bi")
To build up piece by piece
ADT[BDT, allow.cartesian = T]
Ai Aj AVal Bj BVal
1: 1 1 0.39230905 1 0.7083956
2: 2 1 0.89523490 1 0.7083956
3: 3 1 0.92464689 1 0.7083956
4: 4 1 0.15127499 1 0.7083956
5: 5 1 0.88838458 1 0.7083956
---
121: 1 5 0.70144360 5 0.7924433
122: 2 5 0.50409075 5 0.7924433
123: 3 5 0.15693879 5 0.7924433
124: 4 5 0.09164371 5 0.7924433
125: 5 5 0.63787487 5 0.7924433
So far so good. The merge worked properly, Bi has disappeared, but this is encoded by Aj anyway. We now want to multiply AVal by BVal, and then sum the created groups (! in the same expression, I know that I could store and apply a second expression here). I had thought this would be:
ADT[BDT, j = list(Ai, Bj, Value = sum(AVal*BVal)), by = c("Ai","Bj") , allow.cartesian = T]
but I get the error: Object Bj not found. In fact, none of the values from 'BDT' are usable once I insert the by = clause (try to systematically remove Bj,BVal and "Bj" from the expression above, left to right, and you will see what I mean).
Looking into the .EACHI expression, it seems like the motive is here to do what I want, but .EACHI groups on the merged index, not on a separate variable.

Sounds like you simply want to aggregate after the merge:
ADT[BDT, allow.cartesian = T][, sum(AVal * BVal), by = .(Ai, Bj)]

Most efficient way to shift MultiIndex time series

I have a DataFrame that consists of many stacked time series. The index is (poolId, month) where both are integers, the "month" being the number of months since 2000. What's the best way to calculate one-month lagged versions of multiple variables?
Right now, I do something like:
cols_to_shift = ["bal", ...5 more columns...]
df_shift = df[cols_to_shift].groupby(level=0).transform(lambda x: x.shift(-1))
For my data, this took me a full 60 s to run. (I have 48k different pools and a total of 718k rows.)
I'm converting this from R code and the equivalent data.table call:
dt.shift <- dt[, list(bal=myshift(bal), ...), by=list(poolId)]
only takes 9 s to run. (Here "myshift" is something like "function(x) c(x[-1], NA)".)
Is there a way I can get the pandas verison to be back in line speed-wise? I tested this on 0.8.1.
Edit: Here's an example of generating a close-enough data set, so you can get some idea of what I mean:
ids = np.arange(48000)
lens = np.maximum(np.round(15+9.5*np.random.randn(48000)), 1.0).astype(int)
id_vec = np.repeat(ids, lens)
lens_shift = np.concatenate(([0], lens[:-1]))
mon_vec = np.arange(lens.sum()) - np.repeat(np.cumsum(lens_shift), lens)
n = len(mon_vec)
df = pd.DataFrame.from_items([('pool', id_vec), ('month', mon_vec)] + [(c, np.random.rand(n)) for c in 'abcde'])
df = df.set_index(['pool', 'month'])
%time df_shift = df.groupby(level=0).transform(lambda x: x.shift(-1))
That took 64 s when I tried it. This data has every series starting at month 0; really, they should all end at month np.max(lens), with ragged start dates, but good enough.
Edit 2: Here's some comparison R code. This takes 0.8 s. Factor of 80, not good.
library(data.table)
ids <- 1:48000
lens <- as.integer(pmax(1, round(rnorm(ids, mean=15, sd=9.5))))
id.vec <- rep(ids, times=lens)
lens.shift <- c(0, lens[-length(lens)])
mon.vec <- (1:sum(lens)) - rep(cumsum(lens.shift), times=lens)
n <- length(id.vec)
dt <- data.table(pool=id.vec, month=mon.vec, a=rnorm(n), b=rnorm(n), c=rnorm(n), d=rnorm(n), e=rnorm(n))
setkey(dt, pool, month)
myshift <- function(x) c(x[-1], NA)
system.time(dt.shift <- dt[, list(month=month, a=myshift(a), b=myshift(b), c=myshift(c), d=myshift(d), e=myshift(e)), by=pool])

I would suggest you reshape the data and do a single shift versus the groupby approach:
result = df.unstack(0).shift(1).stack()
This switches the order of the levels so you'd want to swap and reorder:
result = result.swaplevel(0, 1).sortlevel(0)
You can verify it's been lagged by one period (you want shift(1) instead of shift(-1)):
In [17]: result.ix[1]
Out[17]:
a b c d e
month
1 0.752511 0.600825 0.328796 0.852869 0.306379
2 0.251120 0.871167 0.977606 0.509303 0.809407
3 0.198327 0.587066 0.778885 0.565666 0.172045
4 0.298184 0.853896 0.164485 0.169562 0.923817
5 0.703668 0.852304 0.030534 0.415467 0.663602
6 0.851866 0.629567 0.918303 0.205008 0.970033
7 0.758121 0.066677 0.433014 0.005454 0.338596
8 0.561382 0.968078 0.586736 0.817569 0.842106
9 0.246986 0.829720 0.522371 0.854840 0.887886
10 0.709550 0.591733 0.919168 0.568988 0.849380
11 0.997787 0.084709 0.664845 0.808106 0.872628
12 0.008661 0.449826 0.841896 0.307360 0.092581
13 0.727409 0.791167 0.518371 0.691875 0.095718
14 0.928342 0.247725 0.754204 0.468484 0.663773
15 0.934902 0.692837 0.367644 0.061359 0.381885
16 0.828492 0.026166 0.050765 0.524551 0.296122
17 0.589907 0.775721 0.061765 0.033213 0.793401
18 0.532189 0.678184 0.747391 0.199283 0.349949
In [18]: df.ix[1]
Out[18]:
a b c d e
month
0 0.752511 0.600825 0.328796 0.852869 0.306379
1 0.251120 0.871167 0.977606 0.509303 0.809407
2 0.198327 0.587066 0.778885 0.565666 0.172045
3 0.298184 0.853896 0.164485 0.169562 0.923817
4 0.703668 0.852304 0.030534 0.415467 0.663602
5 0.851866 0.629567 0.918303 0.205008 0.970033
6 0.758121 0.066677 0.433014 0.005454 0.338596
7 0.561382 0.968078 0.586736 0.817569 0.842106
8 0.246986 0.829720 0.522371 0.854840 0.887886
9 0.709550 0.591733 0.919168 0.568988 0.849380
10 0.997787 0.084709 0.664845 0.808106 0.872628
11 0.008661 0.449826 0.841896 0.307360 0.092581
12 0.727409 0.791167 0.518371 0.691875 0.095718
13 0.928342 0.247725 0.754204 0.468484 0.663773
14 0.934902 0.692837 0.367644 0.061359 0.381885
15 0.828492 0.026166 0.050765 0.524551 0.296122
16 0.589907 0.775721 0.061765 0.033213 0.793401
17 0.532189 0.678184 0.747391 0.199283 0.349949
Perf isn't too bad with this method (it might be a touch slower in 0.9.0):
In [19]: %time result = df.unstack(0).shift(1).stack()
CPU times: user 1.46 s, sys: 0.24 s, total: 1.70 s
Wall time: 1.71 s

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How can I merge two data frames on a range of dates? [duplicate] - sql

Related

How to get same grouping result using data.table comparing to the sqldf?

Loop through irregular list of numbers to append rows to summary table

find closest match within a vector to fill missing values using dplyr

Replicate a SQL Join + Group By in data.table

Most efficient way to shift MultiIndex time series

Categories

Resources