How to define incomplete sets in GAMS? - gams-math

There is an incomplete graph (e.g. including 5 vertices). The adjacency matrix "a" is available. I want to define the set which includes all edges but exclude any other pair of vertices. That is, the pair of vertices belongs to the set of edges iff the element in matrix "a" is positive.
The last line of following code does not work!
sets i "Set of vertices" /1*5/ ;
alias(i,j);
set a(i,j) "Adjacency matrix" ;
Table a(i,j)
1 2 3 4 5
1 0 1 0 1 1
2 1 0 1 0 0
3 0 1 0 0 0
4 1 0 0 0 1
5 1 0 0 1 0;
Set edges(i,j);
edges(i,j) = a(i,j)$(a(i,j)>0);

If you want to have edge , you must define a set and parameter like this :
sets i "Set of vertices" /1*5/ ;
alias(i,j);
set a(i,j) "Adjacency matrix" ;
Table a(i,j)
1 2 3 4 5
1 0 1 0 1 1
2 1 0 1 0 0
3 0 1 0 0 0
4 1 0 0 0 1
5 1 0 0 1 0;
Set edges(i,j);
edges(i,j) $ a(i,j) =yes;

You can simplify your last line to
edges(i,j) = a(i,j);
This automatically acts as if you wrote something like $(a<>0). However, since you defined your symbol a as set already and not as parameter, I think you actually do not have to do anything. A just is what you are looking for. Just do
display a;
and look at the result in the lst file.

Related

is there a function where I can do one hot encoding and removing duplicates in R?

I have this database
ID
LABEL
1
A
1
B
2
B
3
c
I'm trying to do an one hot encoding, which I was able to do. However, I also need to remove the duplicated IDs, so my one hot code appears to be like below:
ID
A
B
C
1
1
0
0
1
0
1
0
2
0
1
0
3
0
0
1
and I need this to be the final database
ID
A
B
C
1
1
1
0
2
0
1
0
3
0
0
1
this is my code
dummy <- dummyVars('~ .', data = data_to_be_encoded)
encoded_data <- data.frame(predict(dummy, newdata = data_to_be_encoded))

How do I grab rows surrounding a flagged value?

I'm starting with a table like this:
code new_code_flag
abc123 0
xyz456 0
wer098 1
jio234 0
bcx190 0
eiw157 0
nzi123 0
epj676 0
ere654 0
yru493 1
ale674 0
I want to grab the 2 records before and 2 records after each value where "new_code_flag"=1. I want my output to look like this:
code new_code_flag
abc123 0
xyz456 0
wer098 1
jio234 0
bcx190 0
epj676 0
ere654 0
yru493 1
ale674 0
Any help on how to do this in SQL or SAS?
SQL tables represent unordered sets. Hence, in SQL you need to have a column that specifies the ordering. Assuming you do, you can do something like:
with t as (
select t.*, row_number() over (order by ?) as seqnum
from tbl t
)
select t.*
from t
where exists (select 1
from t t2
where t2.new_code_flag = 1 and
t.seqnum between t2.seqnum - 2 and t2.seqnum + 2
);
You could create two lag and two lead copies of the flag variable and then test if any of the 5 variables are 1 (true).
data have;
input code $ flag ;
cards;
abc123 0
xyz456 0
wer098 1
jio234 0
bcx190 0
eiw157 0
nzi123 0
epj676 0
ere654 0
yru493 1
ale674 0
;
data want ;
set have ;
set have(keep=flag rename=(flag=lead1_flag) firstobs=2) have(drop=_all_ obs=1);
set have(keep=flag rename=(flag=lead2_flag) firstobs=3) have(drop=_all_ obs=2);
lag1_flag=lag1(flag);
lag2_flag=lag2(flag);
if lag1_flag or lag2_flag or flag or lead1_flag or lead2_flag ;
run;
Results
lead1_ lead2_ lag1_ lag2_
Obs code flag flag flag flag flag
1 abc123 0 0 1 . .
2 xyz456 0 1 0 0 .
3 wer098 1 0 0 0 0
4 jio234 0 0 0 1 0
5 bcx190 0 0 0 0 1
6 epj676 0 0 1 0 0
7 ere654 0 1 0 0 0
8 yru493 1 0 . 0 0
9 ale674 0 . . 1 0
data want(drop=_: i);
merge have have(keep=flag firstobs=3 rename=(flag=_flag));
if flag or _flag then i=1;
if 0<i<=3 then do;
output;
i+1;
end;
else delete;
run;

Counting the number of times the value 1 appears in each row of a VCF converted to a pandas data frame

I am trying to count the number of times the value 1 appears in each row of a Vcf converted into a data frame.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A01_01 A01_02 A01_03 A01_04 A01_05
chr01 27915 27915 T C . . . GT 0 1 0 0 1
chr01 28323 28323 G A . . . GT 0 1 0 0 1
chr01 28652 28652 G T . . . GT 0 1 0 0 1
chr01 29667 29667 C A . . . GT 0 1 0 0 1
chr01 30756 30756 C G . . . GT 0 1 0 0 1
chr01 31059 31059 G A . . . GT 0 1 0 0 1
chr01 31213 31213 G A . . . GT 0 1 0 0 1
chr01 31636 31636 T C . . . GT 0 1 0 0 1
chr01 31756 31756 C T . . . GT 0 1 0 0 1
chr01 31976 31976 C T . . . GT 0 1 0 0 1
this is what the VCF looks like in excel. but with more rows and columns, the extra columns are just more genotypes and the rows are more positions and alleles.
I am trying to count them using a python script. I have successfully converted the Vcf into a pandas data frame using data = pd.read_table("....")
I know I should use the count function, but I am unable to get it to count in the rows that I want. The eventual goal is to make a histogram that shows the frequency of each allele. (1 means it is there 0 means it is not) so I want to count the number of times 1 appears in each row and make a histogram out of the frequencies. Any help would be appreciated.
There are two ways I know of to do this, both using the sum function in pandas. It allows you to take the sum of every numeric type cell in the row (so if you have, say, a column of ID strings, which it looks like you probably do, it'll skip those). If the only numerics in your data are 1s and 0s or you can easily remove any columns with other numerics, that'll do you.
I can't really parse your example data, so let's make some up for an example:
df = pd.DataFrame(np.random.randint(0,2, size=(100,4)), columns=list('ABCD'))
With this data, if you want to add an additional column that's the sum of each row:
df['Sum'] = df.sum(1, skipna=True, numeric_only=True)
Or you can just assign that to a variable itself. Either way you can then give those counts to your preferred plotting package to make your histogram.
If your data is more complex and you have numerics other than 1, you can take the intermediate step of creating a dataframe of booleans first, so if the value of a cell is 1 it's True and otherwise it's False. So let's make another random dataframe:
df2 = pd.DataFrame(np.random.randint(0,10, size=(100,4)), columns=list('ABCD'))
This one is random ints 0-9. Now let's make that intermediate dataframe:
df2_bool = (df2 == 1)
Now we can do that sums thing again:
df2['Sum'] = df2_bool.sum(1, skipna=True, numeric_only=True)
Now you've got counts!
There's likely a better way to do this but this is how I've been doing it and it's served me pretty well.
IIUC, you can do it this way:
In [45]: df.filter(like='A01').sum(axis=1)
Out[45]:
0 2
1 2
2 2
3 2
4 2
5 2
6 2
7 2
8 2
9 2
dtype: int64
In [44]: df.filter(like='A01')
Out[44]:
A01_01 A01_02 A01_03 A01_04 A01_05
0 0 1 0 0 1
1 0 1 0 0 1
2 0 1 0 0 1
3 0 1 0 0 1
4 0 1 0 0 1
5 0 1 0 0 1
6 0 1 0 0 1
7 0 1 0 0 1
8 0 1 0 0 1
9 0 1 0 0 1

Calculating ratio value within a line which contain binary numbers "0" & "1"

I have a data file which contain more than 2000 lines and 45001 columns.
The first column is actually a "string" which explains the data type.
Start from column #2, up to column #45001, the data is reprsented as
"1"
or
"0"
For example, the pattern of data in a line is
(0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0)
The total number of data is 25. Within this data line, there are 5 sub-groups which are made by only the number "1"s e.g. (11 111 1111 1 111 ). The "0"s in between the subgroups are assumed as "delimiter". The total of all "1"s is = 13.
I would like to calculate the ratio of
(total of all "1"s / total of number of sub-groups made only by "1"s)
That is
(13/5).
I tried with this code for calculating the total of all "1"s ;
awk -F '0' '{print NF}' < inputfile.in
This gives value 13.
But I donn't know how to go further from here to calcuate the ratio that I want.
I don't know how to find the number of sub-groups within each line beacuse the number of occurances of "1"s and "0"s are random.
Wish to get some kind help to sort this problem.
Appreciate any help in advance.
It is not clear to me from the description what the format of the input file is. Assume the input looks like:
$ cat file
0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0
To count up the number of ones and the number of groups of ones and take their ratio:
$ awk '{f=0;s1=0;s2=0;for (i=2;i<=NF;i++){s1+=$i;if ($i && !f)s2++;f=$i}; print s1/s2}' file
2.6
Update: Handling all zeros
Suppose one of the lines in the file has all zeros:
$ cat file
0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
For the second line, both sums are zero which would lead to a divide by zero error. We can avoid that by adding an if statement which will print the ratio if one exists or 0/0 is it doesn't:
if (s2>0)print s1/s2; else print s1"/"s2
The complete code is now:
$ awk '{f=0;s1=0;s2=0;for (i=2;i<=NF;i++){s1+=$i;if ($i && !f)s2++;f=$i}; if (s2>0)print s1/s2; else print s1"/"s2}' file
2.6
0/0
How it works
The code uses three variables. f is a flag which is true (1) if we are currently in a group of ones and is false (0) otherwise. s1 is the the number of ones on the line. s2 is the number of groups of ones on the line.
f=0;s1=0;s2=0
At the beginning of each line, we initialize the variables.
for (i=2;i<=NF;i++){s1+=$i;if ($i && !f)s2++;f=$i}
We loop over each field on the line starting with field 2. If the field contains a 1, we increment counter s1. If the field is 1 and is the start of a new group, we increment s2.
if (s2>0)print s1/s2; else print s1"/"s2}
If we encountered at least one one, we print the ratio s1/s2. Otherwise, we print 0/0.
Here is an awk that does what you need:
cat file
data 0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0
data 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
data 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
data 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
BMR_10#O24-BMR_6#O13-H13 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1
data 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1
awk '{$1="";$0="0 "$0" 0";t=split($0,b,"1")-1;gsub(/ +/,"");n=split($0,a,"[^1]+")-2;print (n?t/n:0)}' t
2.6
0
25
11
5.5
3

Setting values in a matrix in bulk

The question is about bulk-changing values in a matrix based on data contained in a vector.
Suppose I have a matrix 5x4 matrix of zeroes.
octave> Z = zeros(5,4)
Z =
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
And a column vector of length equal to the number of rows in Z, that is, 5. The rows in the vector y correspond to rows in the matrix Z.
octave> y = [1; 3; 2; 1; 3]
y =
1
3
2
1
3
What I want is to set 1's in the matrix Z in the columns whose indices are contained as values in the corresponding row of the vector y. Namely, I'd like to have Z matrix like this:
Z = # y =
1 0 0 0 # <-- 1 st column
0 0 1 0 # <-- 3 rd column
0 1 0 0 # <-- 2 nd column
1 0 0 0 # <-- 1 st column
0 0 1 0 # <-- 3 rd column
Is there a concise way of doing it? I know I can implement it using a loop over y, but I have a feeling Octave could have a more laconic way. I am new to Octave.
Since Octave has automatic broadcasting (you'll need Octave 3.6.0 or later), the easies way I can think is to use this with a comparison. Here's how
octave> 1:5 == [1 3 2 1 3]'
ans =
1 0 0 0 0
0 0 1 0 0
0 1 0 0 0
1 0 0 0 0
0 0 1 0 0
Broadcasting is explained on the Octave manual but Scipy also has a good explanation for it with nice pictures.
Found another solution that does not use broadcasting. It does not need a matrix of zeroes either.
octave> y = [1; 3; 2; 1; 3]
octave> eye(5)(y,:)
ans =
1 0 0 0 0
0 0 1 0 0
0 1 0 0 0
1 0 0 0 0
0 0 1 0 0
Relevant reading here:
http://www.gnu.org/software/octave/doc/interpreter/Creating-Permutation-Matrices.html