Reading from input with double trailing ## - input

Input:
G0894 x 1 x 3 x 1 k 1
C4458 x 1 k 5
C9057 x 7 x 4 x 4 x 3 x 5
Desired output:
G0894 x 1
G0894 x 3
G0894 x 1
G0894 k 1
C4458 x 1
C4458 k 5
C9057 x 7
C9057 x 4
C9057 x 4
C9057 x 3
C9057 x 5
This is what I came up with:
data want;
infile cards missover;
input id $ #;
do while (1);
input letter $ number #;
if letter EQ ' ' then leave;
output;
end;
cards;
G0894 x 1 x 3 x 1 k 1
C4458 x 1 k 5
C9057 x 7 x 4 x 4 x 3 x 5
;
run;
And it does work but since we've been talking about double trailing ## in class I think I'm supposed to use it. This was my other approach:
data want;
infile cards missover;
input id $ #;
input letter $ number ##;
cards;
G0894 x 1 x 3 x 1 k 1
C4458 x 1 k 5
C9057 x 7 x 4 x 4 x 3 x 5
;
run;
And it generates an error which says something about using missover and ## in an inconsistent manner. What am I doing wrong?

There is no way in your program for the data step to ever advance to the second row of input data. That is what the error message is telling you.
The ## tells SAS that it should keep the line pointer and column pointer the same when it starts the next data step iteration. The MISSOVER option tells SAS not to go to a new line when it cannot find data to meet the current input request. Hence there is no way for the line pointer to ever advance to line two.

The double trailing at sign (##) holds a record across multiple iterations of the DATA step until the end of the record is reached. However, the single trailing at sign (#) releases a record when control returns to the top of the DATA step.
Try this:
data want;
infile cards missover;
input id $ letter $ number #;
do while (letter ne '' or number ne .);
output;
input letter $ number #;
end;
cards;
G0894 x 1 x 3 x 1 k 1
C4458 x 1 k 5
C9057 x 7 x 4 x 4 x 3 x 5
;
run;

Related

How to create 2 datalines in sas with different length

I want to create a table like that:
a 1 2 3
b 1 2 3 4
a has 3 values, b has 4.
How can I do it in SAS?
When I enter it like that it deletes the 4 at the end.
data my_data;
input a b;
datalines;
1 1
2 2
3 3
4
I am very new to SAS thanks for your advice.
If you want to use LIST MODE input, like in your example, then each variable needs to have a "word" on the line. Use a period to indicate the missing values.
data my_data;
input a b;
datalines;
1 1
2 2
3 3
. 4
;
Otherwise switch to COLUMN MODE input.
data my_data;
input a 1-2 b 3-4 ;
datalines;
1 1
2 2
3 3
4
;
Or FORMATTED MODE
data my_data;
input a 2. b 2.;
datalines;
1 1
2 2
3 3
4
;
Note that you can use the period to indicate a missing value even when the variable is character. This is because the normal character informat will convert that single period into a blank value.
data my_data;
input a $ b;
datalines;
1 1
2 2
3 3
. 4
;

How to group rows in sql with particular order in bunch of sets

Lets say I have a table with 3 columns
one
two
three
x
LF
1
x
FI
2
x
LF
3
x
FI
4
x
FI
5
x
FI
6
x
LF
7
x
FI
8
x
LF
9
x
FI
10
x
LF
11
x
LF
12
x
LF
13
x
LF
14
x
FI
15
Now what I want is to group the 'two' column and take the lowest 'third' column value which will give me output like
one
two
three
x
LF
1
x
FI
2
x
LF
3
x
FI
4
x
LF
7
x
FI
8
x
LF
9
x
FI
10
x
LF
11
x
FI
15
How Can I achieve this?
You seem to want the rows where there is a change. You can use lag():
select one, two, three
from (select t.*,
lag(two) over (order by three) as prev_two
from t
) t
where prev_two is null or prev_two <> two;

Why I do not get the right answer

I think logically the following code is right but I get the wrong answer:
.mod file:
set R := {1,2};
set D1 := {1,2,4,5};
set P1 := {1,2,3,4,5};
var V{D1,R}, binary;
param Ud{D1,R} ;
param U{P1,R} ;
minimize obj{p in D1, r in R}: V[p,r] * (Ud[p,r]+ sum{j in P1: j!=p} U[j,r]);
s.t. a10{ r in R }: sum{p in D1} V[p,r]=2 ;
.dat file:
param Ud: 1 2:=
1 -10 -6
2 -20 -4
4 1 -10
5 -4 -4;
param U: 1 2 :=
1 -8.1 -3
2 -6.8 -8
3 -7.2 1
4 -16 -4
5 -6.8 -4;
Basically for each r and for two p , I want to minimize (Ud[p,r] + sum{j in P: j!=p} U[j,r])
But it always give me V[1,r]=v[5,r]=1 even if V[2,r] minimize obj function.
I except to get V[2,r]=1 because -20 + (-8.1-7.2 -16-6.8) is the most negative.
Your syntax for the objective function is incorrect; it should be
minimize obj: sum {p in D1, r in R} V[p,r] * (Ud[p,r]+ sum{j in P1: j != p} U[j,r]);
(Note the location of the colon (:), and the presence of the sum.) To be honest I'm not exactly sure what AMPL was doing in response to your objective function, but I would just treat the results as unpredictable.
With the revised objective function, the optimal solution is:
ampl: display V;
V :=
1 1 1
1 2 1
2 1 1
2 2 0
4 1 0
4 2 1
5 1 0
5 2 0
;

Remove patterns in a tab file

enter code hereHi everyone
I have a data frame such as :
I have a file such as:
scaffold_1_1 X 2 2
scaffold_24_0 X 9 2
scaffold_15 X 2 2
IDBA_scaffold_30_1 X 2 317
scf7180005161000_2 X 1 2
And the idea is simply to remove the last number part of all names in the first
but there are 3 types of scaffolds_names:
scaffold_number0_number1
scaffold_number0
IDBA_scaffold_number0_number1
scfXXX_number1
and the idea is to remove all the number_1, here is the result I should get in this example:
scaffold_1 X 2 2
scaffold_24 X 9 2
scaffold_15 X 2 2
IDBA_scaffold_30 X 2 317
scf7180005161000 X 1 2
Have you an idea to deal with that?
Thank you for you help.
1st Solution: Could you please try following.(in case someone simply want to substitute last _ and following digits then only following may help.
awk '{sub(/_[0-9]+$/,"",$1)} 1' Input_file
2nd solution:
In case you want to check if there should be more than 2 _ values in 1st field which is starting from string sacffold then try following.
awk '(/scaffold/ && num=split($1,a,"_")>2) || /scf/{sub(/_[0-9]+$/,"",$1)} 1' Input_file
Output will be as follow.
scaffold_1 X 2 2
scaffold_24 X 9 2
scaffold_15 X 2 2
IDBA_scaffold_30 X 2 317
scf7180005161000 X 1 2
You can try Perl,
perl -pe ' s/(^\S+)_\d\b/$1/g '
with your inputs
$ cat bean.txt
scaffold_1_1 X 2 2
scaffold_24_0 X 9 2
scaffold_15 X 2 2
IDBA_scaffold_30_1 X 2 317
scf7180005161000_2 X 1 2
$ perl -pe ' s/(^\S+)_\d\b/$1/g ' bean.txt
scaffold_1 X 2 2
scaffold_24 X 9 2
scaffold_15 X 2 2
IDBA_scaffold_30 X 2 317
scf7180005161000 X 1 2
$
Thanks #anubhava for catching one of the edge cases and helping to fix it.
$ cat bean2.txt
scaffold_1_1 X 2 2
scaffold_24_0 X 9 2
scaffold_15 X 2 2
IDBA_scaffold_30_1 X 2 317
scaffold_1_15 X 2 2 # => this was not fixed in first answer
$ perl -pe 's/^(?!scaffold_\d+\b)(\S+)_\d+\b/$1/g' bean2.txt
scaffold_1 X 2 2
scaffold_24 X 9 2
scaffold_15 X 2 2
IDBA_scaffold_30 X 2 317
scaffold_1 X 2 2
$
Here is another awk variant:
awk 'BEGIN{FS=OFS="\t"} $1 ~ /^scf[0-9]+_[0-9]+$/ || split($1, a, "_") > 2 {
sub(/_[0-9]+$/, "", $1) } 1' file
scaffold_1 X 2 2
scaffold_24 X 9 2
scaffold_15 X 2 2
IDBA_scaffold_30 X 2 317
scf7180005161000 X 1 2
Using any sed that supports -E for EREs, e.g. GNU or OSX/BSD seds:
$ sed -E 's/((_|scf)[0-9]+)_[0-9]+/\1/' file
scaffold_1 X 2 2
scaffold_24 X 9 2
scaffold_15 X 2 2
IDBA_scaffold_30 X 2 317
scf7180005161000 X 1 2

Clubiing two lines into one using AWK

I have a file with following format
Time Number Val
x 1 y
x 1 y
a 1 z
b 1 m
b 2 m
I want to club lines with same value, the final file should be something like this
Time Number Val
x 2 y
a 1 z
b 3 m
How to do this using awk?
You can use awk's associative array:
awk 'NR==1{print $0} NR!=1{a[$1]+=$2; b[$1]=$3;} \
END{ for ( i in a) print i, a[i], b[i]}' file
For your sample input, it prints:
Time Number Val
x 2 y
a 1 z
b 3 m
Count all duplicate Time and Val combinations:
awk 'NR>1{a[$1,$3]+=$2;next}$1=$1;END{for(k in a){split(k,s,SUBSEP);print s[1],a[k],s[2]}}' OFS="\t" file
Time Number Val
a 1 z
b 3 m
x 2 y