Concatenate multiple files and create new file based on the value - awk

I have more than 50 files as like this
dell.txt
Name Id Year Value
xx.01 45 1990 2k
SS.01 89 2000 6.0k
Hp.txt
Name Id Year Value
xx.01 48 1994 21k
SS.01 80 2001 2k
Apple.txt
Name Id Year Value
xx.02 45 1990 20k
SS.01 89 2000 60k
kp.03 23 1996 530k
I just need to make a new file as like this
Name Id Year dell Hp Apple
xx.01 45 1990 2k 0 0
xx.01 48 1994 0 21k 0
xx.02 45 1990 0 0 20k
SS.01 80 2001 0 2k 0
SS.01 89 2000 6.0k 0 60k
kp.03 23 1996 0 0 530k
I tried with paste for concatenation but it is adding different order. any other way using awk? I used flowing code:
$ awk ' FNR==1{ if (!($0 in h)){file=h[$0]=i++} else{file=h[$0];next} } {print >> (file)} ' *.txt –

Could you please try following, written and tested with GNU awk and is giving output in sorted format.
awk '
FNR==1{
tfile=FILENAME
sub(/\..*/,"",tfile)
file=(file?file OFS:"")tfile
header=($1 FS $2 FS $3)
next
}
{
a[$1 FS $2 FS $3 "#" FILENAME]=$NF
}
END{
print header,file
for(i in a){
oldi=i
split(i,arr,"#")
sub(/#.*/,"",i)
printf("%s ",i)
for(i=1;i<=ARGIND;i++){
val=(val?val OFS:"")((arr[1] "#" ARGV[i]) in a?a[oldi]:0)
}
printf("%s\n",val)
val=""
}
}
' dell.txt Hp.txt Apple.txt | sort -k1 | column -t
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking if this is 1st line.
tfile=FILENAME
sub(/\..*/,"",tfile)
file=(file?file OFS:"")tfile ##Creating file which has all Input_file names in it.
header=($1 FS $2 FS $3) ##Header has 3 fields in it from 1st line.
next ##next will skip all further statements from here.
}
{
a[$1 FS $2 FS $3 "#" FILENAME]=$NF ##Creating a with index of 1st, 2nd, 3rd fields # Input_file name and has value as last field.
}
END{ ##Starting END block of this awk program from here.
print header,file ##Printing header and file variables here.
for(i in a){ ##Traversing through a here.
oldi=i ##Setting i value as oldi here.
split(i,arr,"#") ##Splitting i with arr delimiter as # here.
sub(/#.*/,"",i) ##Substituting from # to till last of line with NULL.
printf("%s ",i) ##Printing i here.
for(i=1;i<=ARGIND;i++){ ##Running a for loop till ARGIND value from i=1
val=(val?val OFS:"")((arr[1] "#" ARGV[i]) in a?a[oldi]:0) ##Creating val if arr[1] "#" ARGV[i] in a then have a value with index a[oldi] or put 0.
}
printf("%s\n",val) ##Printing val here with new line.
val="" ##Nullifying val here.
}
}
' dell.txt Hp.txt Apple.txt | sort -k1 | column -t ##Mentioning Input_file names, sorting output and then using column -t to look output well.
Output will be as follows.
Name Id Year dell Hp Apple
SS.01 80 2001 0 2k 0
SS.01 89 2000 6.0k 0 6.0k
SS.01 89 2000 60k 0 60k
kp.03 23 1996 0 0 530k
xx.01 45 1990 2k 0 0
xx.01 48 1994 0 21k 0
xx.02 45 1990 0 0 20k

Here is an awk script to join the files as required.
BEGIN { OFS = "\t"}
NR==1 { col[++c] = $1 OFS $2 OFS $3 }
FNR==1 {
split(FILENAME, arr, ".")
f = arr[1]
col[++c] = f
next
}
{
id[$1 OFS $2 OFS $3] = $4
cell[$1 OFS $2 OFS $3 OFS f] = $4
}
END {
for (i=1; i<=length(col); i++) {
printf col[i] OFS
}
printf ORS
for (i in id) {
printf i OFS
for (c=2; c<=length(col); c++) {
printf (cell[i OFS col[c]] ? cell[i OFS col[c]] : "0") OFS
}
printf ORS
}
}
Usage:
awk -f tst.awk *.txt | sort -nk3
Note that the glob fetches the files in alphabetical order and the arguments order determines the column order of the output. If you want a different column order, you have to order the arguments, for example like below.
Output is a real tab-columned file, but if you want a tab-like look with spaces, pipe to column -t
Testing
Using your sample files and providing their order:
> awk -f tst.awk dell.txt Hp.txt Apple.txt | sort -nk3 | column -t
Name Id Year dell Hp Apple
xx.01 45 1990 2k 0 0
xx.02 45 1990 0 0 20k
xx.01 48 1994 0 21k 0
kp.03 23 1996 0 0 530k
SS.01 89 2000 6.0k 0 60k
SS.01 80 2001 0 2k 0

Related

Making AWK code more efficient when evaluating sets of records

I have a file with 5 fields of content. I am evaluating 4 lines at a time in the file. So, records 1-4 are evaluated as a set. Records 5-8 are another set. Within each set, I want to extract the time from field 5 when field 4 has the max value. If there are duplicate values in field 4, then evaluate the maximum value in field 2 and use the time in field 5 associated with the max value in field 2.
For example, in the first 4 records, there is a duplicate max value in field 4 (value of 53). If that is true, I need to look at field 2 and find the maximum value. Then print the time associated with the max value in field 2 with the time in field 5.
The Data Set is:
00 31444 8.7 24 00:04:32
00 44574 12.4 25 00:01:41
00 74984 20.8 53 00:02:22
00 84465 23.5 53 00:12:33
01 34748 9.7 38 01:59:28
01 44471 12.4 37 01:55:29
01 74280 20.6 58 01:10:24
01 80673 22.4 53 01:55:49
The desired Output for records 1 through 4 is 00:12:33
The desired output for records 5 through 8 is 01:10:24
Here is my answer:
Evaluate Records 1 through 4
awk 'NR==1,NR==4 {if(max <= $4) {max = $4; time = $5} else if(max == $4) {max = $2; time = $5};next}END {print time}' test.txt test.txt
Output is: 00:12:33
Evaluate Records 5 through 8
awk 'NR==5,NR==8 {if(max <= $4) {max = $4; time = $5} else if(max == $4) {max = $2; time = $5};next}END {print time}' test.txt test.txt
Output is 01:10:24
Any suggestions on how to evaluate the record ranges more efficiently without having to write an awk statement for each set of records?
Thanks
Based on your sample input, the fact there's 4 lines for each key (first field) seems to be irrelevant and what you really want is to just produce output for each key so consider sorting the input by your desired comparison fields (field 4 then field 2) then printing the first desired output (field 5) value seen for each block per key (field 1):
$ sort -n -k1,1 -k4,4r -k2,2r file | awk '!seen[$1]++{print $5}'
00:12:33
01:10:24
This awk code
NR % 4 == 1 {max4 = $4; max2 = $2}
$4 > max4 || $4 == max4 && $2 >= max2 {max4 = $4; max2 = $2; val5 = $5}
NR % 4 == 0 {printf "lines %d-%d: %s\n", (NR - 3), NR, val5}
outputs
lines 1-4: 00:12:33
lines 5-8: 01:10:24
Looking at the data, you might want to group sets by $1 instead of hardcoding 4 lines:
awk '
function emit(nr) {printf "lines %d-%d: %s\n", nr - 3, nr, val5}
$1 != setId {
if (NR > 1) emit(NR - 1)
setId = $1
max4 = $4
max2 = $2
}
$4 > max4 || $4 == max4 && $2 >= max2 {max4 = $4; max2 = $2; val5 = $5}
END {emit(NR)}
' data
an awk-based solution that utilizes a synthetic ascii-string-comparison key combining $4 and $5, while avoiding any %-modulo operations :
mawk '
BEGIN { CONVFMT = "%020.f" (__=___=____=_____="")
_+=_+=++_ } { ____= __!=(__=__==$((_____=(+$_ "")"(" $NF)^!_) \
? __ : $!!_) || ____<_____ ? _____ : ____
} _==++___ {
printf(" group %-*s [%*.f, %-*.f] :: %s\n", --_*--_, "\"" (__) "\"", _+_,
NR-++_, ++_, NR, substr(____, index(____, "(")+_^(_____=____=___=""))) }'
group "00" [ 1, 4 ] :: 00:12:33
group "01" [ 5, 8 ] :: 01:10:24

Counting the number of unique values based on more than two columns in bash

I need to modify the below code to work on more than one column.
Counting the number of unique values based on two columns in bash
awk ' ##Starting awk program from here.
BEGIN{
FS=OFS="\t"
}
!found[$0]++{ ##Checking condition if 1st and 2nd column is NOT present in found array then do following.
val[$1]++ ##Creating val with 1st column inex and keep increasing its value here.
}
END{ ##Starting END block of this progra from here.
for(i in val){ ##Traversing through array val here.
print i,val[i] ##Printing i and value of val with index i here.
}
}
' Input_file ##Mentioning Input_file name here.
Table to count how many of each double (all DIS)
patient sex DISa DISb DISc DISd DISe DISf DISg DISh DISi
patient1 male 550.1 550.5 594.1 594.3 594.8 591 1019 960.1 550.1
patient2 female 041 208 250.2 276.14 426.32 550.1 550.5 558 041
patient3 female NA NA NA NA NA NA NA 041 NA
The output I need is:
550.1 3
550.5 2
594.1 1
594.3 1
594.8 1
591 1
1019 1
960.1 1
550.1 1
041 3
208 1
250.2 1
276.14 1
426.32 1
558 1
Consider this awk:
awk -v OFS='\t' 'NR > 1 {for (i=3; i<=NF; ++i) if ($i+0 == $i) ++fq[$i]} END {for (i in fq) print i, fq[i]}' file
276.14 1
960.1 1
594.3 1
426.32 1
208 1
041 3
594.8 1
550.1 3
591 1
1019 1
558 1
550.5 2
250.2 1
594.1 1
A more readable form:
awk -v OFS='\t' '
NR > 1 {
for (i=3; i<=NF; ++i)
if ($i+0 == $i)
++fq[$i]
}
END {
for (i in fq)
print i, fq[i]
}' file
$i+0 == $i is a check for making sure column value is numeric.
If the ordering must be preserved, then you need an additional array b[] to keep the order each number is encountered, e.g.
awk '
BEGIN { OFS = "\t" }
FNR > 1 {
for (i=3;i<=NF;i++)
if ($i~/^[0-9]/) {
if (!($i in a))
b[++n] = $i;
a[$i]++
}
}
END {
for (i=1;i<=n;i++)
print b[i], a[b[i]]
}' file
Example Use/Output
$ awk '
> BEGIN { OFS = "\t" }
> FNR > 1 {
> for (i=3;i<=NF;i++)
> if ($i~/^[0-9]/) {
> if (!($i in a))
> b[++n] = $i;
> a[$i]++
> }
> }
> END {
> for (i=1;i<=n;i++)
> print b[i], a[b[i]]
> }' patients
550.1 3
550.5 2
594.1 1
594.3 1
594.8 1
591 1
1019 1
960.1 1
041 3
208 1
250.2 1
276.14 1
426.32 1
558 1
Let me know if you have further questions.
Taking complete solution from above 2 answers(#anubhava and #David) with all respect, just adding a little tweak(of applying check for integer value here as per shown samples of OP) to their solutions and adding 2 solutions here. Written and tested with shown samples only.
1st solution: If order doesn't matter in output try:
awk -v OFS='\t' '
NR > 1 {
for (i=3; i<=NF; ++i)
if (int($i))
++fq[$i]
}
END {
for (i in fq)
print i, fq[i]
}' Input_file
2nd solution: If order matters based on David's answer try.
awk '
BEGIN { OFS = "\t" }
FNR > 1 {
for (i=3;i<=NF;i++)
if (int($i)) {
if (!($i in a))
b[++n] = $i;
a[$i]++
}
}
END {
for (i=1;i<=n;i++)
print b[i], a[b[i]]
}' Input_file
Using GNU awk for multi-char RS:
$ awk -v RS='[[:space:]]+' '$0+0 == $0' file | sort | uniq -c
3 041
1 1019
1 208
1 250.2
1 276.14
1 426.32
3 550.1
2 550.5
1 558
1 591
1 594.1
1 594.3
1 594.8
1 960.1
If the order of fields really matters just pipe the above to awk '{print $2, $1}'.

How to print something multiple times in awk

I have a file sample.txt that looks like this:
Sequence: chr18_gl000207_random
Repeat 1
Indices: 2822--2996 Score: 135
Period size: 36 Copynumber: 4.8 Consensus size: 36
Consensus pattern (36 bp):
TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
Repeat 2
Indices: 2736--3623 Score: 932
Period size: 111 Copynumber: 8.1 Consensus size: 111
Consensus pattern (111 bp):
TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
GTGGTGGCTGTTGTGGTTGTAGCCTCAGTGGAAGTGCCTGCAGTTG
Repeat 3
Indices: 3421--3496 Score: 89
Period size: 39 Copynumber: 1.9 Consensus size: 39
Consensus pattern (39 bp):
AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT
I have used awk to extract values for parameters that are relevant for me like this:
paste <(awk '/Indices/ {print $2}' sample.txt) <(awk '/Period size/ {print $3}' sample.txt) <(awk '/Copynumber/ {print $5}' sample.txt) <(awk '/Consensus pattern/ {getline; print $0}' sample.txt)
Output:
2822--2996 36 4.8 TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
2736--3623 111 8.1 TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
3421--3496 39 1.9 AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT
Now I want to add the parameter Sequence to every row.
Desired output:
chr18_gl000207_random:2822--2996 36 4.8 TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
chr18_gl000207_random:2736--3623 111 8.1 TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
chr18_gl000207_random:3421--3496 39 1.9 AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT
I want to do this for many files in a loop so I need a solution that would work with a different number of Repeats as well.
$ cat tst.awk
BEGIN { OFS="\t" }
$1 == "Sequence:" { seq = $2; next }
$1 == "Indices:" { ind = $2; next }
$1 == "Period" { per = $3; cpy = $5; next }
$1 == "Consensus" { isCon=1; next }
isCon { print seq":"ind, per, cpy, $1; isCon=0 }
$ awk -f tst.awk file
chr18_gl000207_random:2822--2996 36 4.8 TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
chr18_gl000207_random:2736--3623 111 8.1 TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
chr18_gl000207_random:3421--3496 39 1.9 AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT

Prevent awk from adding non-integers?

I have a file that has these columns that I would like to add:
absolute_broad_major_cn
1
1
1
1
1.76
1.76
NA
1
and
absolute_broad_minor_cn
1
1
1
1
0.92
0.92
NA
1
I did awk '{ print $1+$2 }, which worked well but it put 0 for where there was an NA. Is it possible to make awk forget this and just put NA again instead (so awk only adds numbers)?
Edit: Desired output is:
<Column header>
2
2
2
2
2.68
2.68
NA
2
paste absolute* | awk '{ if ($1 == "NA" && $2 == "NA") print "NA"; else print $1 + $2; }'
would do the trick; whether you want && (both are "NA" to produce an "NA") or || (either one is "NA" produces an NA) is specific to your need.
Could you please try following, written and tested with shown samples.
awk '
FNR==NR{
a[FNR]=$0
next
}
{
print ($0~/[a-zA-Z]/ && a[FNR]~/[a-zA-Z]/?"NA":a[FNR]+$0)
}
' absolute_broad_major_cn absolute_broad_minor_cn
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when Input_file absolute_broad_major_cn is being read.
a[FNR]=$0 ##Creating array a with index FNR and having value as current line here.
next ##next will skip all further statements from here.
}
{
print ($0~/[a-zA-Z]/ && a[FNR]~/[a-zA-Z]/?"NA":a[FNR]+$0) ##Printing either addition of current line with array a value or print NA in case any alphabate is found either in array value OR in current line.
}
' absolute_broad_major_cn absolute_broad_minor_cn ##Mentioning Input_file names here.
I think what you're really trying to do is sum 2 numeric columns from 1 file:
awk '{print ($1==($1+0) ? $1+$2 : $1)}' file
$1 == $1+0 will only be true if $1 is a number.
Just remove the lines with NA & then add them
awk '$1 != "NA"' FS=' ' file | awk '{ print $1+$2 }'

separate fields based on first column content, match in second column and subtract in fourth column values in awk

My input file is like:
a10 otu1 xx 44
b24 otu2 xxx 52
x35 otu3 xy 11
x45 otu3 zz 22
z452 Otu5 rr 78
control1 otu1 w 4
control2 otu2 ee 30
control3 otu3 tt 20
control4 otu4 yy 10
First, I want to separate control from the others in column 1, and then match second column
values of control with other’s second column. Where match does find in second column, I want
to subtract the corresponding values in fourth column.
Output file would be:
a10 otu1 xx 40
b24 otu2 xxx 22
x35 otu3 xy -9
x45 otu4 zz 12
z452 Otu5 rr 78
Now, to match the second column and subtract values in fourth column, I use:
awk 'NR==FNR {a[$2]=$2 in a?a[$2]-$4:$4; next} !b[$2]++ {print $1,$2,$3,a[$2]}' inputfile.txt{,}
How can I feed separate field information (control vs others) in the script?
Could you please try following.
awk '
!/^control/{
a[++count1]=$NF
b[count1]=$1 OFS $2 OFS $3
next
}
{
c[++count2]=$NF
}
END{
for(i=1;i<=count1;i++){
print b[i],a[i]-c[i]
}
}
' Input_file
More generic solution: In case you don't want to hardcode field values in first array a and you have more than 4 fields in first file then try following.
awk '
!/^control/{
a[++count1]=$NF
$NF=""
sub(/ +$/,"")
b[count1]=$0
next
}
{
c[++count2]=$NF
}
END{
for(i=1;i<=count1;i++){
print b[i],a[i]-c[i]
}
}
' Input_file
$ cat tst.awk
NR==FNR {
if ( /^control/ ) {
control[$2] = $NF
}
next
}
!/^control/ {
$NF = $NF - control[$2]
print
}
$ awk -f tst.awk file file
a10 otu1 xx 40
b24 otu2 xxx 22
x35 otu3 xy -9
x45 otu3 zz 2
z452 Otu5 rr 78
Here's another take on this:
/^control/ {
a[$2]=a[$2]-$4
next
}
{
a[$2]=a[$2]+$4
b[$2]=$1 OFS $2 OFS $3
}
END {
for(i in b) print b[i] OFS a[i]
}
This subtracts any values on control lines, adds any values on other lines, storing them in the array a[]. It maintains an array of line content, b[].
By storing content in the array, it's possible for there to be multiple data or control lines affecting the value, and they can appear in any order in your input (since 44 - 40 is the same as -40 + 44).
Note that because our END for loop steps through the array, output is not guaranteed to be in the same order as input.