This is the R script that I am attempting to recreate using a CASE WHEN statement in SQL:
dat[ ,X_1_7_Spline := pmax(1,pmin(ifelse(is.na(X),1,X),7))]
It seems that this command is telling the parser to return the parallel maxima of a vector containing a conditional statement as long as the value of variable X lies between 1 and the parallel minima of some value and 7 (as long as the value is not null). It then seems to join the new column containing these values back to the original dataset (dat). I am having some troubles representing the "pmax(1,pmin(ifelse(is.na(X),1,X),7))" portion of the code in my SQL query and would appreciate any ideas on how I might be able to do this effectively.
I have something very remedial right now, which I know does not express this above statement properly:
CASE WHEN MAX(IF(ISNOTNULL(X) AND MIN(X)=1 AND MAX(X)=7) then 1 else X end as X_1_7_Spline
Any thoughts/feedback would be greatly appreciated as I am still trying to understand the R script. Thanks in advance for any insight on this issue.
ifelse(is.na(X),1,X) can be translated into SQL's COALESCE(X, 1); and
pmin and pmax logic can be placed in a CASE WHEN (as you've started)
Perhaps this?
CASE WHEN X < 1 THEN 1
WHEN X > 7 THEN 7
ELSE coalesce(X, 1) END as NewX
We don't need to worry about coalesceing the X < 1 or X > 7 because null < 1 does not resolve as true, so it does not accept that case.
Demo in R using sqldf:
library(data.table)
dat <- data.table(X = c(-1,5,9,NA))
dat[, X_1_7_Spline := pmax(1,pmin(ifelse(is.na(X),1,X),7)) ]
sqldf::sqldf("select *, (CASE WHEN X < 1 THEN 1 WHEN X > 7 THEN 7 ELSE coalesce(X,1) END) as NewX from dat")
# X X_1_7_Spline NewX
# 1 -1 1 1
# 2 5 5 5
# 3 9 7 7
# 4 NA 1 1
Consider the following data.tables. The first defines a set of regions with start and end positions for each group 'x':
library(data.table)
d1 <- data.table(x = letters[1:5], start = c(1,5,19,30, 7), end = c(3,11,22,39,25))
setkey(d1, x, start)
# x start end
# 1: a 1 3
# 2: b 5 11
# 3: c 19 22
# 4: d 30 39
# 5: e 7 25
The second data set has the same grouping variable 'x', and positions 'pos' within each group:
d2 <- data.table(x = letters[c(1,1,2,2,3:5)], pos = c(2,3,3,12,20,52,10))
setkey(d2, x, pos)
# x pos
# 1: a 2
# 2: a 3
# 3: b 3
# 4: b 12
# 5: c 20
# 6: d 52
# 7: e 10
Ultimately I'd like to extract the rows in 'd2' where 'pos' falls within the range defined by 'start' and 'end', within each group x. The desired result is
# x pos start end
# 1: a 2 1 3
# 2: a 3 1 3
# 3: c 20 19 22
# 4: e 10 7 25
The start/end positions for any group x will never overlap but there may be gaps of values not in any region.
Now, I believe I should be using a rolling join. From what i can tell, I cannot use the "end" column in the join.
I've tried
d1[d2, roll = TRUE, nomatch = 0, mult = "all"][start <= end]
and got
# x start end
# 1: a 2 3
# 2: a 3 3
# 3: c 20 22
# 4: e 10 25
which is the right set of rows I want; However "pos" has become "start" and the original "start" has been lost. Is there a way to preserve all the columns with the roll join so i could report "start", "pos", "end" as desired?
Overlap joins was implemented with commit 1375 in data.table v1.9.3, and is available in the current stable release, v1.9.4. The function is called foverlaps. From NEWS:
29) Overlap joins #528 is now here, finally!! Except for type="equal" and maxgap and minoverlap arguments, everything else is implemented. Check out ?foverlaps and the examples there on its usage. This is a major feature addition to data.table.
Let's consider x, an interval defined as [a, b], where a <= b, and y, another interval defined as [c, d], where c <= d. The interval y is said to overlap x at all, iff d >= a and c <= b 1. And y is entirely contained within x, iff a <= c,d <= b 2. For the different types of overlaps implemented, please have a look at ?foverlaps.
Your question is a special case of an overlap join: in d1 you have true physical intervals with start and end positions. In d2 on the other hand, there are only positions (pos), not intervals. To be able to do an overlap join, we need to create intervals also in d2. This is achieved by creating an additional variable pos2, which is identical to pos (d2[, pos2 := pos]). Thus, we now have an interval in d2, albeit with identical start and end coordinates. This 'virtual, zero-width interval' in d2 can then be used in foverlap to do an overlap join with d1:
require(data.table) ## 1.9.3
setkey(d1)
d2[, pos2 := pos]
foverlaps(d2, d1, by.x = names(d2), type = "within", mult = "all", nomatch = 0L)
# x start end pos pos2
# 1: a 1 3 2 2
# 2: a 1 3 3 3
# 3: c 19 22 20 20
# 4: e 7 25 10 10
by.y by default is key(y), so we skipped it. by.x by default takes key(x) if it exists, and if not takes key(y). But a key doesn't exist for d2, and we can't set the columns from y, because they don't have the same names. So, we set by.x explicitly.
The type of overlap is within, and we'd like to have all matches, only if there is a match.
NB: foverlaps uses data.table's binary search feature (along with roll where necessary) under the hood, but some function arguments (types of overlaps, maxgap, minoverlap etc..) are inspired by the function findOverlaps() from the Bioconductor package IRanges, an excellent package (and so is GenomicRanges, which extends IRanges for Genomics).
So what's the advantage?
A benchmark on the code above on your data results in foverlaps() slower than Gabor's answer (Timings: Gabor's data.table solution = 0.004 vs foverlaps = 0.021 seconds). But does it really matter at this granularity?
What would be really interesting is to see how well it scales - in terms of both speed and memory. In Gabor's answer, we join based on the key column x. And then filter the results.
What if d1 has about 40K rows and d2 has a 100K rows (or more)? For each row in d2 that matches x in d1, all those rows will be matched and returned, only to be filtered later. Here's an example of your Q scaled only slightly:
Generate data:
require(data.table)
set.seed(1L)
n = 20e3L; k = 100e3L
idx1 = sample(100, n, TRUE)
idx2 = sample(100, n, TRUE)
d1 = data.table(x = sample(letters[1:5], n, TRUE),
start = pmin(idx1, idx2),
end = pmax(idx1, idx2))
d2 = data.table(x = sample(letters[1:15], k, TRUE),
pos1 = sample(60:150, k, TRUE))
foverlaps:
system.time({
setkey(d1)
d2[, pos2 := pos1]
ans1 = foverlaps(d2, d1, by.x=1:3, type="within", nomatch=0L)
})
# user system elapsed
# 3.028 0.635 3.745
This took ~ 1GB of memory in total, out of which ans1 is 420MB. Most of the time spent here is on subset really. You can check it by setting the argument verbose=TRUE.
Gabor's solutions:
## new session - data.table solution
system.time({
setkey(d1, x)
ans2 <- d1[d2, allow.cartesian=TRUE, nomatch=0L][between(pos1, start, end)]
})
# user system elapsed
# 15.714 4.424 20.324
And this took a total of ~3.5GB.
I just noted that Gabor already mentions the memory required for intermediate results. So, trying out sqldf:
# new session - sqldf solution
system.time(ans3 <- sqldf("select * from d1 join
d2 using (x) where pos1 between start and end"))
# user system elapsed
# 73.955 1.605 77.049
Took a total of ~1.4GB. So, it definitely uses less memory than the one shown above.
[The answers were verified to be identical after removing pos2 from ans1 and setting key on both answers.]
Note that this overlap join is designed with problems where d2 doesn't necessarily have identical start and end coordinates (ex: genomics, the field where I come from, where d2 is usually about 30-150 million or more rows).
foverlaps() is stable, but is still under development, meaning some arguments and names might get changed.
NB: Since I mentioned GenomicRanges above, it is also perfectly capable of solving this problem. It uses interval trees under the hood, and is quite memory efficient as well. In my benchmarks on genomics data, foverlaps() is faster. But that's for another (blog) post, some other time.
data.table v1.9.8+ has a new feature - non-equi joins. With that, this operation becomes even more straightforward:
require(data.table) #v1.9.8+
# no need to set keys on `d1` or `d2`
d2[d1, .(x, pos=x.pos, start, end), on=.(x, pos>=start, pos<=end), nomatch=0L]
# x pos start end
# 1: a 2 1 3
# 2: a 3 1 3
# 3: c 20 19 22
# 4: e 10 7 25
1) sqldf This is not data.table but complex join criteria are easy to specify in a straight forward manner in SQL:
library(sqldf)
sqldf("select * from d1 join d2 using (x) where pos between start and end")
giving:
x start end pos
1 a 1 3 2
2 a 1 3 3
3 c 19 22 20
4 e 7 25 10
2) data.table For a data.table answer try this:
library(data.table)
setkey(d1, x)
setkey(d2, x)
d1[d2][between(pos, start, end)]
giving:
x start end pos
1: a 1 3 2
2: a 1 3 3
3: c 19 22 20
4: e 7 25 10
Note that this does have the disadvantage of forming the possibly large intermeidate result d1[d2] which SQL may not do. The remaining solutions may have this problem too.
3) dplyr This suggests the corresponding dplyr solution. We also use between from data.table:
library(dplyr)
library(data.table) # between
d1 %>%
inner_join(d2) %>%
filter(between(pos, start, end))
giving:
Joining by: "x"
x start end pos
1 a 1 3 2
2 a 1 3 3
3 c 19 22 20
4 e 7 25 10
4) merge/subset Using only the base of R:
subset(merge(d1, d2), start <= pos & pos <= end)
giving:
x start end pos
1: a 1 3 2
2: a 1 3 3
3: c 19 22 20
4: e 7 25 10
Added Note that the data table solution here is much faster than the one in the other answer:
dt1 <- function() {
d1 <- data.table(x=letters[1:5], start=c(1,5,19,30, 7), end=c(3,11,22,39,25))
d2 <- data.table(x=letters[c(1,1,2,2,3:5)], pos=c(2,3,3,12,20,52,10))
setkey(d1, x, start)
idx1 = d1[d2, which=TRUE, roll=Inf] # last observation carried forwards
setkey(d1, x, end)
idx2 = d1[d2, which=TRUE, roll=-Inf] # next observation carried backwards
idx = which(!is.na(idx1) & !is.na(idx2))
ans1 <<- cbind(d1[idx1[idx]], d2[idx, list(pos)])
}
dt2 <- function() {
d1 <- data.table(x=letters[1:5], start=c(1,5,19,30, 7), end=c(3,11,22,39,25))
d2 <- data.table(x=letters[c(1,1,2,2,3:5)], pos=c(2,3,3,12,20,52,10))
setkey(d1, x)
ans2 <<- d1[d2][between(pos, start, end)]
}
all.equal(as.data.frame(ans1), as.data.frame(ans2))
## TRUE
benchmark(dt1(), dt2())[1:4]
## test replications elapsed relative
## 1 dt1() 100 1.45 1.667
## 2 dt2() 100 0.87 1.000 <-- from (2) above
Overlap joins are available in dplyr 1.1.0 via the function join_by.
With join_by, you can do overlap join with between, or manually with >= and <=:
library(dplyr)
inner_join(d2, d1, by = join_by(x, between(pos, start, end)))
# x pos start end
#1 a 2 1 3
#2 a 3 1 3
#3 c 20 19 22
#4 e 10 7 25
inner_join(d2, d1, by = join_by(x, pos >= start, pos <= end))
# x pos start end
#1 a 2 1 3
#2 a 3 1 3
#3 c 20 19 22
#4 e 10 7 25
Using fuzzyjoin :
result <- fuzzyjoin::fuzzy_inner_join(d1, d2,
by = c('x', 'pos' = 'start', 'pos' = 'end'),
match_fun = list(`==`, `>=`, `<=`))
result
# x.x pos x.y start end
# <chr> <dbl> <chr> <dbl> <dbl>
#1 a 2 a 1 3
#2 a 3 a 1 3
#3 c 20 c 19 22
#4 e 10 e 7 25
Since fuzzyjoin returns all the columns we might need to do some cleaning to keep the columns that we want.
library(dplyr)
result %>% select(x = x.x, pos, start, end)
# A tibble: 4 x 4
# x pos start end
# <chr> <dbl> <dbl> <dbl>
#1 a 2 1 3
#2 a 3 1 3
#3 c 20 19 22
#4 e 10 7 25
I am currently writing a program to solve a brain teaser,
How this works:
Using the digits 1-9 only once, make the four corners, and each diagonal = 26
hint make the middle 7
anyways, my code basically starts at "111111111" and counts up, each time checking to see if it matches the required parameters.
Code:
Public Class Main
Dim count As Integer
Dim split() As Char
Dim done As Boolean
Dim attempts As Integer
Private Sub IncreaseOne()
If count < 999999999 Then
count += 1
Else
done = True
End If
If CStr(count).Contains("0") Then
count = CStr(count).Replace("0", "1")
End If
End Sub
Private Sub Reset()
count = 111111111
attempts = 0
End Sub
Private Sub IntToLbl()
split = CStr(count).ToCharArray
lbl1.Text = split(0)
lbl2.Text = split(1)
lbl3.Text = split(2)
lbl4.Text = split(3)
lbl5.Text = split(4)
lbl6.Text = split(5)
lbl7.Text = split(6)
lbl8.Text = split(7)
lbl9.Text = split(8)
lblAttempts.Text = "Attempt: " & attempts
End Sub
Private Sub Check()
attempts += 1
If split(0) + split(1) + split(7) + Int(8) = 26 And split(0) + split(2) + split(4) + split(6) + split(8) = 26 And split(1) + split(3) + split(4) + split(5) + split(7) = 26 Then
If CStr(count).Contains("1") And CStr(count).Contains("2") And CStr(count).Contains("3") And CStr(count).Contains("4") _
And CStr(count).Contains("5") And CStr(count).Contains("6") And CStr(count).Contains("7") And CStr(count).Contains("8") _
And CStr(count).Contains("9") Then
ListBox1.Items.Add("A" & attempts & ": " & CStr(count))
End If
End If
End Sub
Private Sub Act()
While done = False
IncreaseOne()
IntToLbl()
Check()
End While
tr.Abort()
End Sub
Dim suspended As Boolean = False
Dim tr As New System.Threading.Thread(AddressOf Act)
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles btnSolve.Click
If suspended = True Then
tr.Resume()
suspended = False
Else
If tr.IsAlive = False Then
Reset()
tr.Start()
CheckForIllegalCrossThreadCalls = False
Else
Dim Reply = MsgBox("Thread is running! Stop the thread?", MsgBoxStyle.YesNo, "Error!")
If Reply = MsgBoxResult.Yes Then
tr.Suspend()
suspended = True
End If
End If
End If
End Sub
Private Sub Main_FormClosing(sender As Object, e As FormClosingEventArgs) Handles Me.FormClosing
tr.Abort()
End Sub
Private Sub tr2_Tick(sender As Object, e As EventArgs) Handles tr2.Tick
IncreaseOne()
IntToLbl()
Check()
End Sub
End Class
Before using a thread, you should 1) reduce your algorithm complexity and 2) improve its efficiency.
1) For the complexity, since figures can only be here once, you have 9! = 362.880 test to do, which is 27.557 times less tests than a full scan.
I guess that allready at that point you'll be real-time on most computers, but there might be also some combinations for which you can stop the tests before testing all sub-combination ( expl : if first diagonal is not 26, no need to test permutations of other items). With this you could cut down even more the number of tests.
Another way to reduce the case count is to use symmetry. Here, 1 step or 2 step rotations, and horizontal or vertical flip won't affect result, which makes another X16 cut in test count.
2) For the efficiency, using arrays of integers instead of strings will bring you a huge speed boost.
I did a jsfiddle (in javascript, so), that is only testing 9! elements and uses array, it gives result instantly, so i did not look further for early stop / symmetry.
One solution is, for instance : 3,2,7,5,9,6,1,4,8
which makes :
3 6
2 1
7
4 5
8 9
which seems to be ok.
fiddle is here : http://jsfiddle.net/HRdyf/2/
The figures are coded this way : 5 first figures goes for the first diagonal,
the central item has index 2, the 4 others are for the second diagonal except
its central item.
(There might be more efficient ways to encode the array allowing, as explained
earlier, to stop earlier some tests.)
Rq : We can find all solutions with maths :
Let's call c1, c2, c3, c4 the four corners, c the central point, d11, d12, d21, d22 the two
remaining points of the two diagonals.
then
1) c1 + c2 + c3 + c4 = 26
2) c1 + d11 + m + d12 + c3 = 26
3) c2 + d21 + m + d22 + c4 = 26
4) all points are different and in the 1..9 range.
5) (from 4) : sum of all points = 45 (sum from 1 to 9 )
6) from 5) and 1) --> d11 + d12 + m + d21 + d22 = 45 - 26 = 19
(inner points total = total - corner total)
7 ) now adding 2) and 3) and using 1) and 6) we have 19 + 26 + m = 26 + 26
So --->>> m=7
8) considering 1) and 4) and 7), we see that we cannot reach 26 with four integers
different from 7 without using both 9 and 8, ( the max we can reach without 7
and 9 is 8+6+5+4 = 25, and the max reached without 7 and 8 is 9+6+5+4 = 24 )
So --> two corners have 9,8 as value.
9) With 8), 1), 7), and 4) : the two other corners can only be 6,3 or 5,4
(if r1 and r2 are the not 9 or 8 corners, we have r1+ r2 = 9 )
At this point : center is 7, and corners are either [4,5,8,9] or [3,6,8,9] (and permutations)
For [4,5,8,9] - > remains [1,2,3,6] (sum = 12)
For [3,6,8,9] - > remains [1,2,4,5] (sum = 12)
We cannot have 9 and 8 on same diagonal, since 8 + d11 + 7 + d12 + 9 = 26 makes d11 + d12 = 2 which is not
possible considering 4)
Let's consider the corners = [4,5,8,9] case, and see the end of the diagonal starting by 9. It might be
4 or 5.
4 : 9 + d11 + 7 + d12 + 4 = 26 --> d11 + d12 = 6 --> (3,1) is the only solution for d11 and d12 --> remains (2,6) for d21 and d22.
5 ->> d11 + d12 = 7 --> no solution, given 4) and that 4 and 5 are in use
now the corners = [3,6,8,9] case, consider also the end of the diagonal starting by 9. It might end by 6 or 3
3 : d11 + d12 = 7 (5, 2) only solution (4,3 and 6,1 cannot work since 3 and 6 are in use)
6 : d11 + d12 = 10 no solution. (6,4 / / 7,3 / 8,2 / 9,1 all uses a used figure.)
---> so the diagonal starting by 9 can only end by 4 or 3.
deduction ---> the diagonal starting by 8 will end by 5 (when the other one ends by 4) or by 6
(when the other one ends by 3 ).
How many solutions ?
4 possibilities to choose where the 9 is, then 2 choices for the 9 diagonal end (4 or 3) , then 2 choices for the 8 diagonal (starting upstairs or downstairs), then 4 possibilities left for d11, d12 ; d21, d22 choices : [3,1] + [2,6] if we choose 4 as 9's end and [5,2] + [1,4] if we choose 3 as 9's end.
4 *2 * 2 * 4 makes 64 combinations of solutions.
This is one of those problems that requires some analysis(pencil / paper and subtraction) before coding. Since at least one of the diagonals must have 9, the possibilities for that sequence(diagonal) are few. The next number in that sequence can only be 8, 7, or 6 with each of those only having a few possibilities.
9 8 6 2 1
9 8 5 3 1
9 8 4 3 2
9 7 6 3 1 remaining 2 4 5 8 = 19
9 7 5 4 1 remaining 2 3 6 8 = 19
9 7 5 3 2 remaining 1 4 6 8 = 19 *
9 6 5 4 2
(I may have missed some???)
Once those few sequences are known then the sum of the remaining numbers plus one of the numbers from a sequence must equal 26.
edit: a little more pencil / paper work shows that of those only the sequences with 7's in the center work.
edit: John Wein on the MSDN site came up with this math.
the sum of all possible numbers (1-9) = 45
diag1val(26) + diag2val(26) - sum = center square value - 52-45 = 7
sum - cornervals - centerval = values of 4 raidal boxes -> 45 - 26 -
7 = 12
12 can only be some combo of 1,2,3,6 or 1,2,4,5
If you have a simple "try all possibilities" approach, then paralellizing the code could definitely make it faster. And it would be easy, as no changing data needs to be shared between threads.