I Can't code this term in GAMS
if there exist
k'<k s.t. for all s , b(s,k')=b(s,k) then b(s,k)=(b(s,k-1)+b(s,k-2))/2;
else do not Change b(s,k)
loop(k1$(k1.val<k.val ),
loop(s,
if( (B(s,k1)<>B(s,k1)),
Break;
else b(s,k1)= (B(s,k1)+B(s,k1-1))/2 ;
);
);
);
I don't know where I have to put Break. is it nessery to use loop over s? is teher a better way to code that?
example: in below case for s=1 and s=4 , there exist b(s,k=1)=b(s,k=4) then we have to update b(s,'4')=(b(s,'3')+b(s,'4') )/2
iter=k=4, k'={1,2,3}, s={1,2,3};
b(1,1)=3, b(2,1)=7, b(3,1)=9,
b(1,2)=2, b(2,2)=4, b(3,2)=11,
b(1,3)=5, b(2,3)=12, b(3,3)=8,
b(1,4)=3, b(2,4)=7, b(3,4)=9,
I used sameAs command too , but , when I used sameAs(b(s,k'),b(s,k)) (in the loop over k1) I got Error 121!
The following might work for you. Note that it relies on pairs of matches. But given that the update rule is independent of the equality condition, that shouldn't matter here.
Alias k. (I also display b for clarity.)
alias(k,kk);
table b(s,k)
1 2 3 4
1 3 2 5 3
2 7 4 12 7
3 9 11 8 9
;
Define a set Z over s, k, kk. This can pick out matches across k. If k is a subset then you need to define Z across the uppermost parent and an alias of it.
set Z(s,k,kk);
Z(s,k,kk) = no;
Include tuple in set Z when there are equal values across k given s and kk > k.
Z(s,k,kk) = yes$((b(s,k) eq b(s,kk)) and (kk.val gt k.val));
option Z:1:1:2;
display Z;
This defines where the matches exist. E.g. For s=1 (rows), there is a match on k=1 with k=4 (columns).
---- 26 SET Z
1.4
1 YES
2 YES
3 YES
Then we update b where Z is defined.
loop(k,
b(s,kk)$Z(s,k,kk) = (b(s,kk) + b(s,kk-1))/2;);
display b;
This gives me:
---- 31 PARAMETER b
1 2 3 4
1 3.000 2.000 5.000 4.000
2 7.000 4.000 12.000 9.500
3 9.000 11.000 8.000 8.500
Related
I have a dataframe of few hundreds rows , that can be grouped to ids as follows:
df = Val1 Val2 Val3 Id
2 2 8 b
1 2 3 a
5 7 8 z
5 1 4 a
0 9 0 c
3 1 3 b
2 7 5 z
7 2 8 c
6 5 5 d
...
5 1 8 a
4 9 0 z
1 8 2 z
I want to use GridSearchCV , but with a custom CV that will assure that all the rows from the same ID will always be on the same set.
So either all the rows if a are in the test set , or all of them are in the train set - and so for all the different IDs.
I want to have 5 folds - so 80% of the ids will go to the train and 20% to the test.
I understand that it can't guarentee that all folds will have the exact same amount of rows - since one ID might have more rows than the other.
What is the best way to do so?
As stated, you can provide cv with an iterator. You can use GroupShuffleSplit(). For example, once you use it to split your dataset, you can put the result within GridSearchCV() for the cv parameter.
As mentioned in the sklearn documentation, there's a parameter called "cv" where you can provide "An iterable yielding (train, test) splits as arrays of indices."
Do check out the documentation in future first.
As mentioned previously, GroupShuffleSplit() splits data based on group lables. However, the test sets aren't necessarily disjoint (i.e. doing multiple splits, an ID may appear in multiple test sets). If you want each ID to appear in exactly one test fold, you could use GroupKFold(). This is also available in Sklearn.model_selection, and directly extends KFold to take into account group lables.
A dummy dataset is :
data <- data.frame(
group = c(1,1,1,1,1,2),
dates = as.Date(c("2005-01-01", "2006-05-01", "2007-05-01","2004-08-01",
"2005-03-01","2010-02-01")),
value = c(10,20,NA,40,NA,5)
)
For each group, the missing values need to be filled with the non-missing value corresponding to the nearest date within same group. In case of a tie, pick any.
I am using dplyr. which.closest from birk but it needs a vector and a value. How to look up within a vector without writing loops. Even if there is an SQL solution, will do.
Any pointers to the solution?
May be something like: value = value[match(which.closest(dates,THISdate) & !is.na(value))]
Not sure how to specify Thisdate.
Edit: The expected value vector should look like:
value = c(10,20,20,40,10,5)
Using knn1 (nearest neighbor) from the class package (which comes with R -- don't need to install it) and dplyr define an na.knn1 function which replaces each NA value in x with the non-NA x value having the closest time.
library(class)
na.knn1 <- function(x, time) {
is_na <- is.na(x)
if (sum(is_na) == 0 || all(is_na)) return(x)
train <- matrix(time[!is_na])
test <- matrix(time[is_na])
cl <- x[!is_na]
x[is_na] <- as.numeric(as.character(knn1(train, test, cl)))
x
}
data %>% mutate(value = na.knn1(value, dates))
giving:
group dates value
1 1 2005-01-01 10
2 1 2006-05-01 20
3 1 2007-05-01 20
4 1 2004-08-01 40
5 1 2005-03-01 10
6 2 2010-02-01 5
Add an appropriate group_by if the intention was to do this by group.
You can try the use of sapply to find the values closest since the x argument in `which.closest only takes a single value.
first create a vect whereby the dates with no values are replaced with NA and use it within the which.closest function.
library(birk)
vect=replace(data$dates,which(is.na(data$value)),NA)
transform(data,value=value[sapply(dates,which.closest,vec=vect)])
group dates value
1 1 2005-01-01 10
2 1 2006-05-01 20
3 1 2007-05-01 20
4 1 2004-08-01 40
5 1 2005-03-01 10
6 2 2010-02-01 5
if which.closest was to take a vector then there would be no need of sapply. But this is not the case.
Using the dplyr package:
library(birk)
library(dplyr)
data%>%mutate(vect=`is.na<-`(dates,is.na(value)),
value=value[sapply(dates,which.closest,vect)])%>%
select(-vect)
MSSQL: i have this example data:
NAME AValue BValue
A 1 11
B 1 11
C 2 11
D 2 21
E 3 21
F 3 21
G 4 31
H 4 31
I 5 41
J 5 NULL
...
I am looking for algorhitm which looks for all the Names closed by values by different seed (AValue and Bvalue, in this case seed is given by 2 for AValue and by 3 for Bvalue, but this can be skipped and given later and so on, not only looking for smallest multiple). In this case output should be 1,2,3,4,11,21,31 as a first group/result. Then all the Names with these values can be updated etc.
I need to find out all the Names in "closed circle" of values by different seed.
EDIT:
(try of simplier example)
Imagine that you have list of names. Each name is given two numbers. In most cases these numbers are given by some seed (in this example AValue is given twice, BValue three times) but some numbers can be skipped, so you cannot just count smallest multiple of these different seeds(in this case it would be 2x3, ever 6 names you have closed group where no Name contains AValue or BValue from next/different group). For example Name A have 1 and 11. 1 is given for A and B, 11 for A, B, C. These Names have 1,2,11,21. So you check for 2 and 21 and then you get E and F in addition and then the loop of checking should continue, but as long as no more Names are contained there should be output 1,2,3,11,21. "Closed circle"
I am trying to replicate a SQL sparse matrix multiplication using data tables. The SQL expression would be:
SELECT a.i, b.j, SUM(a.value*b.value)
FROM a, b
WHERE a.j = b.i
GROUP BY a.i, b.j;
where my data is structured as |i|j|value in each table
To create this data in R you can use:
library(reshape2)
library(data.table)
A <- matrix(runif(25),5,5)
B <- matrix(runif(25),5,5)
ADT <- data.table(melt(A))
BDT <- data.table(melt(B))
setnames(ADT,old = c("Var1","Var2","value"), new = c("Ai","Aj","AVal"))
setnames(BDT,old = c("Var1","Var2","value"), new = c("Bi","Bj","BVal"))
To merge using .[ we need to set the keys that we join on:
setkey(ADT,"Aj")
setkey(BDT,"Bi")
To build up piece by piece
ADT[BDT, allow.cartesian = T]
Ai Aj AVal Bj BVal
1: 1 1 0.39230905 1 0.7083956
2: 2 1 0.89523490 1 0.7083956
3: 3 1 0.92464689 1 0.7083956
4: 4 1 0.15127499 1 0.7083956
5: 5 1 0.88838458 1 0.7083956
---
121: 1 5 0.70144360 5 0.7924433
122: 2 5 0.50409075 5 0.7924433
123: 3 5 0.15693879 5 0.7924433
124: 4 5 0.09164371 5 0.7924433
125: 5 5 0.63787487 5 0.7924433
So far so good. The merge worked properly, Bi has disappeared, but this is encoded by Aj anyway. We now want to multiply AVal by BVal, and then sum the created groups (! in the same expression, I know that I could store and apply a second expression here). I had thought this would be:
ADT[BDT, j = list(Ai, Bj, Value = sum(AVal*BVal)), by = c("Ai","Bj") , allow.cartesian = T]
but I get the error: Object Bj not found. In fact, none of the values from 'BDT' are usable once I insert the by = clause (try to systematically remove Bj,BVal and "Bj" from the expression above, left to right, and you will see what I mean).
Looking into the .EACHI expression, it seems like the motive is here to do what I want, but .EACHI groups on the merged index, not on a separate variable.
Sounds like you simply want to aggregate after the merge:
ADT[BDT, allow.cartesian = T][, sum(AVal * BVal), by = .(Ai, Bj)]
From my "Id" Column I want to remove the one and zero's from the left.
That is
1000003 becomes 3
1000005 becomes 5
1000011 becomes 11 and so on
Ignore -1, 10 and 1000000, they will be handled as special cases. but from the remaining rows I want to remove the "1" followed by zeros.
Well you can use modulus to get the end of the numbers (they will be the remainder). So just exclude the rows with ids of [-1,10,1000000] and then compute the modulus of 1000000:
print df
Id
0 -1
1 10
2 1000000
3 1000003
4 1000005
5 1000007
6 1000009
7 1000011
keep = df.Id.isin([-1,10,1000000])
df.Id[~keep] = df.Id[~keep] % 1000000
print df
Id
0 -1
1 10
2 1000000
3 3
4 5
5 7
6 9
7 11
Edit: Here is a fully vectorized string slice version as an alternative (Like Alex' method but takes advantage of pandas' vectorized string methods):
keep = df.Id.isin([-1,10,1000000])
df.Id[~keep] = df.Id[~keep].astype(str).str[1:].astype(int)
print df
Id
0 -1
1 10
2 1000000
3 3
4 5
5 7
6 9
7 11
Here is another way you could try to do it:
def f(x):
"""convert the value to a string, then select only the characters
after the first one in the string, which is 1. For example,
100005 would be 00005 and I believe it's returning 00005.0 from
dataframe, which is why the float() is there. Then just convert
it to an int, and you'll have 5, etc.
"""
return int(float(str(x)[1:]))
# apply the function "f" to the dataframe and pass in the column 'Id'
df.apply(lambda row: f(row['Id']), axis=1)
I get that this question is satisfactory answered. But for future visitors, what I like about alex' answer is that it does not depend on there to be exactly four zeros. The accepted answer will fail if you sometimes have 10005, sometimes 1000005 and whatever.
However, to add something more to the way we think about it. If you know it's always going to be 10000, you can do
# backup all values
foo = df.id
#now, some will be negative or zero
df.id = df.id - 10000
#back in those that are negative or zero (here, first three rows)
df.if[df.if <= 0] = foo[df.id <= 0]
It gives you the same as Karl's answer, but I typically prefer these kind of methods for their readability.