How to print something multiple times in awk - awk

I have a file sample.txt that looks like this:
Sequence: chr18_gl000207_random
Repeat 1
Indices: 2822--2996 Score: 135
Period size: 36 Copynumber: 4.8 Consensus size: 36
Consensus pattern (36 bp):
TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
Repeat 2
Indices: 2736--3623 Score: 932
Period size: 111 Copynumber: 8.1 Consensus size: 111
Consensus pattern (111 bp):
TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
GTGGTGGCTGTTGTGGTTGTAGCCTCAGTGGAAGTGCCTGCAGTTG
Repeat 3
Indices: 3421--3496 Score: 89
Period size: 39 Copynumber: 1.9 Consensus size: 39
Consensus pattern (39 bp):
AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT
I have used awk to extract values for parameters that are relevant for me like this:
paste <(awk '/Indices/ {print $2}' sample.txt) <(awk '/Period size/ {print $3}' sample.txt) <(awk '/Copynumber/ {print $5}' sample.txt) <(awk '/Consensus pattern/ {getline; print $0}' sample.txt)
Output:
2822--2996 36 4.8 TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
2736--3623 111 8.1 TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
3421--3496 39 1.9 AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT
Now I want to add the parameter Sequence to every row.
Desired output:
chr18_gl000207_random:2822--2996 36 4.8 TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
chr18_gl000207_random:2736--3623 111 8.1 TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
chr18_gl000207_random:3421--3496 39 1.9 AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT
I want to do this for many files in a loop so I need a solution that would work with a different number of Repeats as well.

$ cat tst.awk
BEGIN { OFS="\t" }
$1 == "Sequence:" { seq = $2; next }
$1 == "Indices:" { ind = $2; next }
$1 == "Period" { per = $3; cpy = $5; next }
$1 == "Consensus" { isCon=1; next }
isCon { print seq":"ind, per, cpy, $1; isCon=0 }
$ awk -f tst.awk file
chr18_gl000207_random:2822--2996 36 4.8 TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
chr18_gl000207_random:2736--3623 111 8.1 TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
chr18_gl000207_random:3421--3496 39 1.9 AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT

Related

Making AWK code more efficient when evaluating sets of records

I have a file with 5 fields of content. I am evaluating 4 lines at a time in the file. So, records 1-4 are evaluated as a set. Records 5-8 are another set. Within each set, I want to extract the time from field 5 when field 4 has the max value. If there are duplicate values in field 4, then evaluate the maximum value in field 2 and use the time in field 5 associated with the max value in field 2.
For example, in the first 4 records, there is a duplicate max value in field 4 (value of 53). If that is true, I need to look at field 2 and find the maximum value. Then print the time associated with the max value in field 2 with the time in field 5.
The Data Set is:
00 31444 8.7 24 00:04:32
00 44574 12.4 25 00:01:41
00 74984 20.8 53 00:02:22
00 84465 23.5 53 00:12:33
01 34748 9.7 38 01:59:28
01 44471 12.4 37 01:55:29
01 74280 20.6 58 01:10:24
01 80673 22.4 53 01:55:49
The desired Output for records 1 through 4 is 00:12:33
The desired output for records 5 through 8 is 01:10:24
Here is my answer:
Evaluate Records 1 through 4
awk 'NR==1,NR==4 {if(max <= $4) {max = $4; time = $5} else if(max == $4) {max = $2; time = $5};next}END {print time}' test.txt test.txt
Output is: 00:12:33
Evaluate Records 5 through 8
awk 'NR==5,NR==8 {if(max <= $4) {max = $4; time = $5} else if(max == $4) {max = $2; time = $5};next}END {print time}' test.txt test.txt
Output is 01:10:24
Any suggestions on how to evaluate the record ranges more efficiently without having to write an awk statement for each set of records?
Thanks
Based on your sample input, the fact there's 4 lines for each key (first field) seems to be irrelevant and what you really want is to just produce output for each key so consider sorting the input by your desired comparison fields (field 4 then field 2) then printing the first desired output (field 5) value seen for each block per key (field 1):
$ sort -n -k1,1 -k4,4r -k2,2r file | awk '!seen[$1]++{print $5}'
00:12:33
01:10:24
This awk code
NR % 4 == 1 {max4 = $4; max2 = $2}
$4 > max4 || $4 == max4 && $2 >= max2 {max4 = $4; max2 = $2; val5 = $5}
NR % 4 == 0 {printf "lines %d-%d: %s\n", (NR - 3), NR, val5}
outputs
lines 1-4: 00:12:33
lines 5-8: 01:10:24
Looking at the data, you might want to group sets by $1 instead of hardcoding 4 lines:
awk '
function emit(nr) {printf "lines %d-%d: %s\n", nr - 3, nr, val5}
$1 != setId {
if (NR > 1) emit(NR - 1)
setId = $1
max4 = $4
max2 = $2
}
$4 > max4 || $4 == max4 && $2 >= max2 {max4 = $4; max2 = $2; val5 = $5}
END {emit(NR)}
' data
an awk-based solution that utilizes a synthetic ascii-string-comparison key combining $4 and $5, while avoiding any %-modulo operations :
mawk '
BEGIN { CONVFMT = "%020.f" (__=___=____=_____="")
_+=_+=++_ } { ____= __!=(__=__==$((_____=(+$_ "")"(" $NF)^!_) \
? __ : $!!_) || ____<_____ ? _____ : ____
} _==++___ {
printf(" group %-*s [%*.f, %-*.f] :: %s\n", --_*--_, "\"" (__) "\"", _+_,
NR-++_, ++_, NR, substr(____, index(____, "(")+_^(_____=____=___=""))) }'
group "00" [ 1, 4 ] :: 00:12:33
group "01" [ 5, 8 ] :: 01:10:24

Concatenate multiple files and create new file based on the value

I have more than 50 files as like this
dell.txt
Name Id Year Value
xx.01 45 1990 2k
SS.01 89 2000 6.0k
Hp.txt
Name Id Year Value
xx.01 48 1994 21k
SS.01 80 2001 2k
Apple.txt
Name Id Year Value
xx.02 45 1990 20k
SS.01 89 2000 60k
kp.03 23 1996 530k
I just need to make a new file as like this
Name Id Year dell Hp Apple
xx.01 45 1990 2k 0 0
xx.01 48 1994 0 21k 0
xx.02 45 1990 0 0 20k
SS.01 80 2001 0 2k 0
SS.01 89 2000 6.0k 0 60k
kp.03 23 1996 0 0 530k
I tried with paste for concatenation but it is adding different order. any other way using awk? I used flowing code:
$ awk ' FNR==1{ if (!($0 in h)){file=h[$0]=i++} else{file=h[$0];next} } {print >> (file)} ' *.txt –
Could you please try following, written and tested with GNU awk and is giving output in sorted format.
awk '
FNR==1{
tfile=FILENAME
sub(/\..*/,"",tfile)
file=(file?file OFS:"")tfile
header=($1 FS $2 FS $3)
next
}
{
a[$1 FS $2 FS $3 "#" FILENAME]=$NF
}
END{
print header,file
for(i in a){
oldi=i
split(i,arr,"#")
sub(/#.*/,"",i)
printf("%s ",i)
for(i=1;i<=ARGIND;i++){
val=(val?val OFS:"")((arr[1] "#" ARGV[i]) in a?a[oldi]:0)
}
printf("%s\n",val)
val=""
}
}
' dell.txt Hp.txt Apple.txt | sort -k1 | column -t
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking if this is 1st line.
tfile=FILENAME
sub(/\..*/,"",tfile)
file=(file?file OFS:"")tfile ##Creating file which has all Input_file names in it.
header=($1 FS $2 FS $3) ##Header has 3 fields in it from 1st line.
next ##next will skip all further statements from here.
}
{
a[$1 FS $2 FS $3 "#" FILENAME]=$NF ##Creating a with index of 1st, 2nd, 3rd fields # Input_file name and has value as last field.
}
END{ ##Starting END block of this awk program from here.
print header,file ##Printing header and file variables here.
for(i in a){ ##Traversing through a here.
oldi=i ##Setting i value as oldi here.
split(i,arr,"#") ##Splitting i with arr delimiter as # here.
sub(/#.*/,"",i) ##Substituting from # to till last of line with NULL.
printf("%s ",i) ##Printing i here.
for(i=1;i<=ARGIND;i++){ ##Running a for loop till ARGIND value from i=1
val=(val?val OFS:"")((arr[1] "#" ARGV[i]) in a?a[oldi]:0) ##Creating val if arr[1] "#" ARGV[i] in a then have a value with index a[oldi] or put 0.
}
printf("%s\n",val) ##Printing val here with new line.
val="" ##Nullifying val here.
}
}
' dell.txt Hp.txt Apple.txt | sort -k1 | column -t ##Mentioning Input_file names, sorting output and then using column -t to look output well.
Output will be as follows.
Name Id Year dell Hp Apple
SS.01 80 2001 0 2k 0
SS.01 89 2000 6.0k 0 6.0k
SS.01 89 2000 60k 0 60k
kp.03 23 1996 0 0 530k
xx.01 45 1990 2k 0 0
xx.01 48 1994 0 21k 0
xx.02 45 1990 0 0 20k
Here is an awk script to join the files as required.
BEGIN { OFS = "\t"}
NR==1 { col[++c] = $1 OFS $2 OFS $3 }
FNR==1 {
split(FILENAME, arr, ".")
f = arr[1]
col[++c] = f
next
}
{
id[$1 OFS $2 OFS $3] = $4
cell[$1 OFS $2 OFS $3 OFS f] = $4
}
END {
for (i=1; i<=length(col); i++) {
printf col[i] OFS
}
printf ORS
for (i in id) {
printf i OFS
for (c=2; c<=length(col); c++) {
printf (cell[i OFS col[c]] ? cell[i OFS col[c]] : "0") OFS
}
printf ORS
}
}
Usage:
awk -f tst.awk *.txt | sort -nk3
Note that the glob fetches the files in alphabetical order and the arguments order determines the column order of the output. If you want a different column order, you have to order the arguments, for example like below.
Output is a real tab-columned file, but if you want a tab-like look with spaces, pipe to column -t
Testing
Using your sample files and providing their order:
> awk -f tst.awk dell.txt Hp.txt Apple.txt | sort -nk3 | column -t
Name Id Year dell Hp Apple
xx.01 45 1990 2k 0 0
xx.02 45 1990 0 0 20k
xx.01 48 1994 0 21k 0
kp.03 23 1996 0 0 530k
SS.01 89 2000 6.0k 0 60k
SS.01 80 2001 0 2k 0

AWK prints empty line of NA's at end of file

I have an older script that has been bugging me for a while now, which has a small bug in it that I haven't really gotten around to fixing, but I think it's about time. The script basically appends the columns of different files based on the ID of the rows. For example...
test1.txt:
a 3
b 2
test2.txt:
a 5
b 9
... should yield a result of:
a 3 5
b 2 9
The script itself looks like this:
#!/bin/bash
gawk 'BEGIN { OFS="\t" }
{ vals[$1,ARGIND]=$2; keys[$1] }
END {
for (key in keys) {
printf "%s%s", key, OFS
for (colNr=1; colNr<=ARGIND; colNr++) {
printf "%s%s", vals[key,colNr], (colNr<ARGIND?OFS:ORS)
}
} printf "\n"
}' $1 $2
... called as $ script.sh test1.txt test2.txt. The problem is that the result I get is not exactly what I should get:
a 3 5
b 2 9
NA NA NA
... i.e. I get a row with NA's at the very end of the file. So far I've just been deleting this row manually, but it'd be nice to not have to do that. I don't really see where this weird functionality is coming from, though... Anybody got any ideas? I'm using GAWK on OSX, if that matters.
Here's some actual input (that's what I get for trying to make the question simple and to the point! =P)
target_id length eff_length est_counts tpm
ENST00000574176 596 282 6 0.825408
ENST00000575242 103 718 105 5.19804
ENST00000573052 291 291 21 2.61356
ENST00000312051 223 192 2559 46.8843
I'm interested in the target_id and tpm columns, the others are unimportant. My full script:
FILES=$(find . -name 'data.txt' | xargs)
# Get replicate names for column header
printf "%s" 'ENSTID'
for file in $FILES; do
file2="${file/.results\/data.txt/}"
file3="${file2/.\/*\//}"
printf "\t%s" $file3
done
printf "\n"
gawk 'BEGIN { OFS="\t" }
{ vals[$1,ARGIND]=$5; keys[$1] }
END {
for (key in keys) {
printf "%s%s", key, OFS
for (colNr=1; colNr<=ARGIND; colNr++) {
printf "%s%s", vals[key,colNr], (colNr<ARGIND?OFS:ORS)
}
} printf "\n"
}' $FILES
(i.e. all the files are named data.txt, but are located in differently named subfolders.)
A simpler idiomatic way to do it would be
$ cat test1.txt
a 3
b 2
$ cat test2.txt
a 5
b 9
$ awk -v OFS="\t" 'NR==FNR{rec[$1]=$0;next}$1 in rec{print rec[$1],$2}' test1.txt test2.txt
a 3 5
b 2 9
For the actual input
$ cat test1.txt
target_id length eff_length est_counts tpm
ENST00000574176 596 282 6 0.825408
ENST00000575242 103 718 105 5.19804
ENST00000573052 291 291 21 2.61356
ENST00000312051 223 192 2559 46.8843
$ cat test2.txt
target_id length eff_length est_counts tpm
ENST00000574176 996 122 6 0.3634
ENST00000575242 213 618 105 7.277
ENST00000573052 329 291 89 2.0356
ENST00000312051 21 00 45 0.123
$ awk 'NR==FNR{rec1[$1]=$1;rec2[$1]=$5;next}$1 in rec1{printf "%-20s %-15s %-15s\n", rec1[$1],rec2[$1],$5}' test1.txt test2.txt
target_id tpm tpm
ENST00000574176 0.825408 0.3634
ENST00000575242 5.19804 7.277
ENST00000573052 2.61356 2.0356
ENST00000312051 46.8843 0.123
Notes :
-v OFS="\t" is for tab separated fields in output, order of passed files is important (Matters to first solution).
Hard-coding newlines as in
printf "%-20s %-15s %-15s\n", rec1[$1],rec2[$1],$5
is not a good idea as it renders the script less portable.You may well replace it with
printf "%-20s %-15s %-15s", rec1[$1],rec2[$1],$5;print # same effect
Edit : for more than two files
$ shopt -s globstar
$ awk 'NR==FNR{rec1[$1]=$1" "$5;next}{if($1 in rec1){rec1[$1]=rec1[$1]" "$5}else{rec1[$1]=$1" "$5}}END{for(i in rec1){if(i != "target_id"){print rec1[i];}}}' **/test*.txt
ENST00000312051 46.8843 46.8843 0.123 46.8843 0.123
ENST00000573052 2.61356 2.61356 2.0356 2.61356 2.0356
ENST00000575242 5.19804 5.19804 7.277 5.19804 7.277
ENST00000574176 0.825408 0.825408 0.3634 0.825408 0.3634
ENST77777777777 01245
ENST66666666666 7.277 7.277
$ shopt -u globstar
As far as I can see, the only reason you would get an empty line at the end of the output (which is what I get with gawk on OS X) is that you have a printf "\n" at the end of the script, which will add a newline even though you've just printed ORS.
Since your bash script is essentially an awk script, I would make a proper awk script out of it. That would additionally save you the problem of having incorrect quoting of $1 and $2 in the shell script (would break on exotic filenames). This also gives you proper syntax highlighting in your favourite text editor, if it understands Awk:
#!/usr/bin/gawk -f
BEGIN { OFS = "\t" }
{
vals[$1,ARGIND] = $2;
keys[$1] = 1;
}
END {
for (key in keys) {
printf("%s%s", key, OFS);
for (colNr = 1; colNr <= ARGIND; colNr++) {
printf("%s%s", vals[key,colNr], (colNr < ARGIND ? OFS : ORS));
}
}
}
The same can be done with more complex sed editing scripts.

awk to divide fields based on match in file1

I am trying to use awk to do the below steps
find matching fields $1 strings between file1 and file2
if the $1 strings match then $2 in file1 is divided by $3 in file2 (that is x which is 3 signifigant figures rounded up)
x is multiplied by 100
each x is subtracted from 100 and that is the %
file1
USH2A 21
GIT1 357
PALB2 3
file2
GIT1 21 3096
USH2A 71 17718
PALB2 13 3954
awk
awk 'NR==FNR{a[$1]=$1;next;}{if ($1 in a) print $1, $2/a[$3];else print;}' file2 file1 > test
awk: cmd. line:1: (FILENAME=search FNR=2) fatal: division by zero attempted
awk 'NR==FNR{a[$1]=$1;next;}{if ($1 in a) print $1, $2/a[$3];else print;}' file1 file2 > test
awk: cmd. line:1: (FILENAME=search FNR=1) fatal: division by zero attempted
example
USH2A match is found so (21/17718)*100 = 0.11 and 100-0.11 = 99.99%
GIT1 match is found so (357/3096)*100 = 11.53 and 100-11.53 = 88.47%
PALB2 match is found so (3/3954) *100 = 0.07 and 100-0.7 = 99.93%
I am going line by line in the code and can see that I am already getting errors. Thank you :).
awk to the rescue!
$ awk 'function ceil(v) {return int(v)==v?v:int(v+1)}
NR==FNR{f1[$1]=$2; next}
$1 in f1{print $1, ceil(10000*(1-f1[$1]/$3))/100 "%"}' file1 file2
GIT1 88.47%
USH2A 99.89%
PALB2 99.93%
note that there is no round-up in awk so defined a ceil function for this task.
$ cat tst.awk
NR==FNR { a[$1]=$3; next }
$1 in a {
x = (a[$1] ? ($2*100)/a[$1] : 0)
printf "%s match is found so (%d/%d) *100 = %.2f and 100-%.2f = %.2f%%\n", $1, $2, a[$1], x, x, 100-x
}
$ awk -f tst.awk file2 file1
USH2A match is found so (21/17718) *100 = 0.12 and 100-0.12 = 99.88%
GIT1 match is found so (357/3096) *100 = 11.53 and 100-11.53 = 88.47%
PALB2 match is found so (3/3954) *100 = 0.08 and 100-0.08 = 99.92%

awk setting variables to make a range

I have the following two files:
File 1:
1 290 rs1345
2 450 rs5313
1 1120 rs4523
2 790 rs4325
File 2:
1 201 LDLR
2 714 APOA5
1 818 NOTCH5
1 514 TTN
I wish to isolate only the rows in file 2 in which the second field is within 100 units of the second field in file 1 (if field 1 matches):
Desired output: (note the third field is from the matching line in file1).
1 201 LDLR rs1345
2 714 APOA5 rs4325
I tried using the following code:
for i in {1..4} #there are 4 lines in file2
do
chr=$(awk 'NR=="'${i}'" { print $1 }' file2)
pos=$(awk 'NR=="'${i}'" { print $2 }' file2)
gene=$(awk 'NR=="'${i}'" { print $3 }' file2)
start=$(echo $pos | awk '{print $1-100}') #start and end variables for 100 unit range
end=$(echo $pos | awk '{print $1+100}')
awk '{if ($1=="'$chr'" && $2 > "'$start'" && $2 < "'$end'") print "'$chr'","'$pos'","'$gene'"$3}' file1
done
The code is not working, I believe something is wrong with my start and end variables, because when I echo $start, I get 414, which doesn't make sense to me and I get 614 when i echo $end.
I understand this question might be difficult to understand so please ask me if any clarification is necessary.
Thank you.
The difficulty is that $1 is not a unique key, so some care needs to be taken with the data structure to store the data in file 1.
With GNU awk, you can use arrays of arrays:
gawk '
NR==FNR {f1[$1][$2] = $3; next}
$1 in f1 {
for (val in f1[$1])
if (val-100 <= $2 && $2 <= val+100)
print $0, f1[$1][val]
}
' file1 file2
Otherwise, you have to use a one-dimensional array and stuff 2 pieces of information into the key:
awk '
NR==FNR {f1[$1,$2] = $3; next}
{
for (key in f1) {
split(key, a, SUBSEP)
if (a[1] == $1 && a[2]-100 <= $2 && $2 <= a[2]+100)
print $0, f1[key]
}
}
' file1 file2
That works with mawk and nawk (and gawk)
#!/usr/bin/python
import pandas as pd
from StringIO import StringIO
file1 = """
1 290 rs1345
2 450 rs5313
1 1120 rs4523
2 790 rs4325
"""
file2 = """
1 201 LDLR
2 714 APOA5
1 818 NOTCH5
1 514 TTN
"""
sio = StringIO(file1)
df1 = pd.read_table(sio, sep=" ", header=None)
df1.columns = ["a", "b", "c"]
sio = StringIO(file2)
df2 = pd.read_table(sio, sep=" ", header=None)
df2.columns = ["a", "b", "c"]
df = pd.merge(df2, df1, left_on="a", right_on="a", how="outer")
#query is intuitive
r = df.query("b_y-100 < b_x <b_y + 100")
print r[["a", "b_x", "c_x", "c_y"]]
output:
a b_x c_x c_y
0 1 201 LDLR rs1345
7 2 714 APOA5 rs4325
pandas is the right tool to do such tabular data manipulation.