Find and replace and move a line that contains a specific string - awk

Assuming I have the following text file:
a b c d 1 2 3
e f g h 1 2 3
i j k l 1 2 3
m n o p 1 2 3
How do I replace '1 2 3' with '4 5 6' in the line that contains the letter (e) and move it after the line that contains the letter (k)?
N.B. the line that contains the letter (k) may come in any location in the file, the lines are not assumed to be in any order
My approach is
Remove the line I want to replace
Find the lines before the line I want to move it after
Find the lines after the line I want to move it after
append the output to a file
grep -v 'e' $original > $file
grep -B999 'k' $file > $output
grep 'e' $original | sed 's/1 2 3/4 5 6/' >> $output
grep -A999 'k' $file | tail -n+2 >> $output
rm $file
mv $output $original
but there is a lot of issues in this solution:
a lot of grep commands that seems unnecessary
the argument -A999 and -B999 are assuming the file would not contain lines more than 999, it would be better to have another way to get lines before and after the matched line
I am looking for a more efficient way to achieve that

Using sed
$ sed '/e/{s/1 2 3/4 5 6/;h;d};/k/{G}' input_file
a b c d 1 2 3
i j k l 1 2 3
e f g h 4 5 6
m n o p 1 2 3

Here is a GNU awk solution:
awk '
/\<e\>/{
s=$0
sub("1 2 3", "4 5 6", s)
next
}
/\<k\>/ && s {
printf("%s\n%s\n",$0,s)
next
} 1
' file
Or POSIX awk:
awk '
function has(x) {
for(i=1; i<=NF; i++) if ($i==x) return 1
return 0
}
has("e") {
s=$0
sub("1 2 3", "4 5 6", s)
next
}
has("k") && s {
printf("%s\n%s\n",$0,s)
next
} 1
' file
Either prints:
a b c d 1 2 3
i j k l 1 2 3
e f g h 4 5 6
m n o p 1 2 3
This works regardless of the order of e and k in the file:
awk '
function has(x) {
for(i=1; i<=NF; i++) if ($i==x) return 1
return 0
}
has("e") {
s=$0
sub("1 2 3", "4 5 6", s)
next
}
FNR<NR && has("k") && s {
printf("%s\n%s\n",$0,s)
s=""
next
}
FNR<NR
' file file

This awk should work for you:
awk '
/(^| )e( |$)/ {
sub(/1 2 3/, "4 5 6")
p = $0
next
}
1
/(^| )k( |$)/ {
print p
p = ""
}' file
a b c d 1 2 3
i j k l 1 2 3
e f g h 4 5 6
m n o p 1 2 3

This might work for you (GNU sed):
sed -n '/e/{s/1 2 3/4 5 6/;s#.*#/e/d;/k/s/.*/\&\\n&/#p};' file | sed -f - file
Design a sed script by passing the file twice and applying the sed instructions from the first pass to the second.
Another solution is to use ed:
cat <<\! | ed file
/e/s/1 2 3/4 5 6/
/e/m/k/
wq
!
Or if you prefer:
<<<$'/e/s/1 2 3/4 5 6/\n.m/k/\nwq' ed -s file

Related

Looping through combinations of selected strings in specific columns and counting their occurrence

I have
A 34 missense fixed
A 33 synonymous fixed
B 12 synonymous var
B 34 missense fixed
B 34 UTR fixed
B 45 missense var
TRI 4 synonymous var
TRI 4 intronic var
3 3 synonymous fixed
I wanna output the counts of the combinations missense && fixed, missense && var, synonymous && fixed, synonymous && var , for each element in $1
missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 0
TRI 0 0 0 1
3 0 0 1 0
I can do this way with 4 individual commands selecting for each combination and concatenating the outputs
awk -F'\t' '($3~/missense/ && $4~/fixed/)' file | awk -F'\t' '{count[$1"\t"$3"\t"$4]++} END {for (word in count) print word"\t"count[word]}' > out
But I'm would like to do this for all combinations at once. I've tried some variations of this but not able to make it work
awk print a[i] -v delim=":" -v string='missense:synonymous:fixed:var' 'BEGIN {n = split(string, a, delim); for (i = 1; i <= n-2; ++i) {count[xxxx}++}} END ;for (word in count) print word"\t"count[word]}
You may use this awk with multiple arrays to hold different counts:
awk -v OFS='\t' '
{keys[$1]}
/missense fixed/ {++mf[$1]}
/missense var/ {++mv[$1]}
/synonymous fixed/ {++sf[$1]}
/synonymous var/ {++sv[$1]}
END {
print "-\tmissensefixed\tmissensevar\tsynonymousfixed\tsynonymousvar"
for (i in keys)
print i, mf[i]+0, mv[i]+0, sf[i]+0, sv[i]+0
}
' file | column -t
- missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 1
TRI 0 0 0 1
3 0 0 1 0
I have used column -t for tabular output only.
GNU awk supports arrays of arrays, so if it is your awk you can count your records with something as simple as num[$1][$3$4]++. The most complex part is the final human-friendly printing:
$ cat foo.awk
{ num[$1][$3$4]++ }
END {
printf(" missensefixed missensevar synonymousfixed synonymousvar\n");
for(r in num) printf("%3s%14d%12d%16d%14d\n", r, num[r]["missensefixed"],
num[r]["missensevar"], num[r]["synonymousfixed"], num[r]["synonymousvar"])}
$ awk -f foo.awk data.txt
missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 1
TRI 0 0 0 1
3 0 0 1 0
Using any awk in any shell on every Unix box with an assist from column to convert the tab-separated awk output to a visually tabular display if you want it:
$ cat tst.awk
BEGIN {
OFS = "\t"
numTags = split("missensefixed missensevar synonymousfixed synonymousvar",tags)
}
{
keys[$1]
cnt[$1,$3 $4]++
}
END {
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
printf "%s%s", OFS, tag
}
print ""
for (key in keys) {
printf "%s", key
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
val = cnt[key,tag]
printf "%s%d", OFS, val
}
print ""
}
}
$ awk -f tst.awk file
missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 1
TRI 0 0 0 1
3 0 0 1 0
$ awk -f tst.awk file | column -s$'\t' -t
missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 1
TRI 0 0 0 1
3 0 0 1 0
I'd highly recommend you always give every column a header string though so it doesn't make further processing of the data harder (e.g. reading it into Excel and sorting on headers), so if I were you I'd add printf "key" or something else that more accurately identifies that columns contents as the first line of the END section (i.e. on a line immediately before the first for loop) so the first column gets a header too:
$ awk -f tst.awk file | column -s$'\t' -t
key missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 1
TRI 0 0 0 1
3 0 0 1 0

How to print out lines starting with keyword and connected with backslash with sed or awk

For example, I'd like to print out the line starting with set_2 and connected with \ like this. I'd like to know whether it's possible do to it with sed, awk or any other text process command lines.
< Before >
set_1 abc def
set_2 a b c d\
e f g h\
i j k l\
m n o p
set_3 ghi jek
set_2 aaa bbb\
ccc ddd\
eee fff
set_4 1 2 3 4
< After text process >
set_2 a b c d\
e f g h\
i j k l\
m n o p
set_2 aaa bbb\
ccc ddd\
eee fff
Try the following:
awk -v st="set_2" '/^set/ {set=$1} /\\$/ && set==st { prnt=1 } prnt==1 { print } !/\\$/ { prnt=0 }' file
Explanation:
awk -v st="set_2" ' # Pass the set to track as a variable st
/^set/ {
set=$1 # When the line begins with "set", track the set in the variable set
}
/\\$/ && set==st {
prnt=1 # When we are in the required set block and the line ends with "/", set a print marker (prnt) to 1
}
prnt==1 {
print # When the print marker is 1, print the line
}
!/\\$/ {
prnt=0 # If the line doesn't end with "/". set the print marker to 0
}' file
Would you try the sed solution:
sed -nE '
/^set_2/ { ;# if the line starts with "set_2" execute the block
:a ;# define a label "a"
/\\[[:space:]]*$/! {p; bb} ;# if the line does not end with "\", print the pattern space and exit the block
N ;# append the next line to the pattern space
ba ;# go to label "a"
} ;# end of the block
:b ;# define a label "b"
' file
Please note the character class [[:space:]]* is inserted just because the OP's posted example contains whitespaces after the slash.
[Alternative]
If perl is your option, following will also work:
perl -ne 'print if /^set_2/..!/\\\s*$/' file
This simple awk command should do the job:
awk '!/^[[:blank:]]/ {p = ($1 == "set_2")} p' file
set_2 a b c d\
e f g h\
i j k l\
m n o p
set_2 aaa bbb\
ccc ddd\
eee fff
And with this awk :
awk -F'[[:blank:]]*' '$1 == "set_2" || $NF ~ /\$/ {print $0;f=1} f && $1 == ""' file
set_2 a b c d\
e f g h\
i j k l\
m n o p
set_2 aaa bbb\
ccc ddd\
eee fff
This might work for you (GNU sed):
sed ':a;/set_2/{:b;n;/set_/ba;bb};d' file
If a line contains set_2 print it and go on printing until another line containing set_ then repeat the first test.
Otherwise delete the line.

add filename without the extension at certain columns using awk

I would like to leave empty first four columns, then I want to add filename without extension in the last 4 columns. I have files as file.frq and goes on. Later I will apply this to the 200 files in loop.
input
CHR POS REF ALT AF HOM Het Number of animals
1 94980034 C T 0 0 0 5
1 94980057 C T 0 0 0 5
Desired output
file file file file
CHR POS REF ALT AF HOM Het Number of animals
1 94980034 C T 0 0 0 5
1 94980057 C T 0 0 0 5
I tried this from Add file name and empty column to existing file in awk
awk '{$0=(NR==1? " \t"" \t"" \t"" \t":FILENAME"\t") "\t" $0}7' file2.frq
But it gave me this:
CHR POS REF ALT AF HOM Het Number of animals
file2.frq 1 94980034 C T 0 0 0 5
file2.frq 1 94980057 C T 0 0 0 5
file2.frq 1 94980062 G C 0 0 0 5
and I also tried this
awk -v OFS="\t" '{print FILENAME, $1=" ",$2=" ",$3=" ", $4=" ",$5 - end}' file2.frq
but it gave me this
CHR POS REF ALT AF HOM Het Number of animals
file2.frq 1 94980034 C T 0 0 0 5
file2.frq 1 94980057 C T 0 0 0 5
any help will be appreciated!
Assuming your input is tab-separated like your desired output:
awk '
BEGIN { FS=OFS="\t" }
NR==1 {
orig = $0
fname = FILENAME
sub(/\.[^.]*$/,"",fname)
$1=$2=$3=$4 = ""
$5=$6=$7=$8 = fname
print
$0 = orig
}
1' file.txt
file file file file
CHR POS REF ALT AF HOM Het Number of animals
1 94980034 C T 0 0 0 5
1 94980057 C T 0 0 0 5
To see it in table format:
$ awk '
BEGIN { FS=OFS="\t" }
NR==1 {
orig = $0
fname = FILENAME
sub(/\.[^.]*$/,"",fname)
$1=$2=$3=$4 = ""
$5=$6=$7=$8 = fname
print
$0 = orig
}
1' file.txt | column -s$'\t' -t
file file file file
CHR POS REF ALT AF HOM Het Number of animals
1 94980034 C T 0 0 0 5
1 94980057 C T 0 0 0 5

Counter in in awk if else loop

can you explain to me why this simple onliner does not work? Thanks for your time.
awk 'BEGIN{i=1}{if($2 == i){print $0} else{print "0",i} i=i+1}' check
input text file with name "check":
a 1
b 2
c 3
e 5
f 6
g 7
desired output:
a 1
b 2
c 3
0 4
e 5
f 6
g 7
output received:
a 1
b 2
c 3
0 4
0 5
0 6
awk 'BEGIN{i=1}{ if($2 == i){print $0; } else{print "0",i++; print $0 } i++ }' check
increment i one more time in the else (you are inserting a new line)
print the currentline in the else, too
this works only if there is only one line missing between the present lines, otherwise you need a loop printing the missing lines
Or simplified:
awk 'BEGIN{i=1}{ if($2 != i){print "0",i++; } print $0; i++ }' check
Yours is broken because:
you read the next line ("e 5"),
$2 is not equal to your counter,
you print the placeholder line and increment your counter (to 5),
you do not print the current line
you read the next line ("f 6")
goto 2
A while loop is warranted here -- that will also handle the case when you have gaps greater than a single number.
awk '
NR == 1 {prev = $2}
{
while ($2 > prev+1)
print "0", ++prev
print
prev = $2
}
' check
or, if you like impenetrable one-liners:
awk 'NR==1{p=$2}{while($2>p+1)print "0",++p;p=$2}1' check
All you need is:
awk '{while (++i<$2) print 0, i}1' file
Look:
$ cat file
a 1
b 2
c 3
e 5
f 6
g 7
k 11
n 14
$ awk '{while (++i<$2) print 0, i}1' file
a 1
b 2
c 3
0 4
e 5
f 6
g 7
0 8
0 9
0 10
k 11
0 12
0 13
n 14

Transforming multiple entries of data for the same ID into a row-awk

I have data in the following format:
ID Date X1 X2 X3
1 01/01/00 1 2 3
1 01/02/00 7 8 5
2 01/03/00 9 7 1
2 01/04/00 1 4 5
I would like to group measurements into new rows according to ID, so I end up with:
ID Date X1 X2 X3 Date X1_2 X2_2 X3_2
1 01/01/00 1 2 3 01/02/00 7 8 5
2 01/03/00 9 7 1 01/04/00 1 4 5
etc.
I have as many as 20 observations for a given ID.
So far I have tried the technique given by http://gadgetsytecnologia.com/da622c17d34e6f13e/awk-transpose-childids-column-into-row.html
The code I have tried so far is:
awk -F, OFS = '\t' 'NR >1 {a[$1] = a[$1]; a[$2] = a[$2]; a[$3] = a[$3];a[$4] = a[$4]; a[$5] = a[$5] OFS $5} END {print "ID,Date,X1,X2,X3,Date_2,X1_2, X2_2 X3_2'\t' for (ID in a) print a[$1:$5] }' file.txt
The file is a tab delimited file. I don't know how to manipulate the data, or to account for the fact that there will be more than two observations per person.
Just keep track of what was the previous first field. If it changes, print the stored line:
awk 'NR==1 {print; next} # print header
prev && $1!=prev {print prev, line; line=""} # print on different $1
{prev=$1; $1=""; line=line $0} # store data and remove $1
END {print prev, line}' file # print trailing line
If you have tab-separated fields, just add -F"\t".
Test
$ awk 'NR==1 {print; next} prev && $1!=prev {print prev, line; line=""} {prev=$1; $1=""; line=line $0} END {print prev, line}' a
ID Date X1 X2 X3
1 01/01/00 1 2 3 01/02/00 7 8 5
2 01/03/00 9 7 1 01/04/00 1 4 5
you can try this (gnu-awk solution)
gawk '
NR == 1 {
N = NF;
MAX = NF-1;
for(i=1; i<=NF; i++){ #store columns names
names[i]=$i;
}
next;
}
{
for(i=2; i<=N; i++){
a[$1][length(a[$1])+1] = $i; #store records for each id
}
if(length(a[$1])>MAX){
MAX = length(a[$1]);
}
}
END{
firstline = names[1];
for(i=1; i<=MAX; i++){ #print first line
column = int((i-1)%(N-1))+2
count = int((i-1)/(N-1));
firstline=firstline OFS names[column];
if(count>0){
firstline=firstline"_"count
}
}
print firstline
for(id in a){ #print each record in store
line = id;
for(i=1; i<=length(a[id]); i++){
line=line OFS a[id][i];
}
print line;
}
}
' input
input
ID Date X1 X2 X3
1 01/01/00 1 2 3
1 01/02/00 7 8 5
2 01/03/00 9 7 1
2 01/04/00 1 4 5
1 01/03/00 72 28 25
you get
ID Date X1 X2 X3 Date_1 X1_1 X2_1 X3_1 Date_2 X1_2 X2_2 X3_2
1 01/01/00 1 2 3 01/02/00 7 8 5 01/03/00 72 28 25
2 01/03/00 9 7 1 01/04/00 1 4 5