Repetitive grabbing with awk - awk

I am trying to compare two files and combine different columns of each. The example files are:
1.txt
chr8 12 24 . . + chr1 11 12 XX4 -
chr3 22 33 . . + chr4 60 61 XXX9 -
2.txt
chr1 11 1 X1 X2 11 12 2.443 0.843 +1 SXSD 1.3020000
chr1 11 2 X3 X4 11 12 0.888 0.833 -1 XXSD -28.887787
chr1 11 3 X5 X6 11 12 0.888 0.843 +1 XXSD 2.4909883
chr1 12 4 X7 X8 11 12 0.888 0.813 -1 CMKY 0.0009223
chr1 12 5 X9 X10 11 12 0.888 0.010 -1 XASD 0.0009223
chr1 12 6 X11 X12 11 12 0.888 0.813 -1 XUPS 0.10176998
I want to compare the 1st,6th and 7th columns of 2.txt, with 7th,8th and 9th columns of 1.txt, and if there is a match, I want to print the whole line of 1.txt with 3th and 12th columns of 2.txt.
The expected output is :
chr8 12 24 . . + chr1 11 12 XX4 - 1 1.3020000
chr8 12 24 . . + chr1 11 12 XX4 - 2 -28.887787
chr8 12 24 . . + chr1 11 12 XX4 - 3 2.4909883
chr8 12 24 . . + chr1 11 12 XX4 - 4 0.0009223
chr8 12 24 . . + chr1 11 12 XX4 - 5 0.0009223
chr8 12 24 . . + chr1 11 12 XX4 - 6 0.10176998
My trial is with awk:
awk 'NR==FNR{ a[$1,$6,$7]=$3"\t"$12; next } { s=SUBSEP; k=$7 s $8 s $9 }k in a{ print $0,a[k] }' 2.txt 1.txt
It outputs only the last match and I cannot make it print all matches:
chr8 12 24 . . + chr1 11 12 XX4 - 6 0.10176998
How can I repetitively search and print all matches?

You're making it much harder than it has to be by reading the 2nd file first.
$ cat tst.awk
NR==FNR { a[$7,$8,$9] = $0; next }
($1,$6,$7) in a { print a[$1,$6,$7], $3, $12 }
$ awk -f tst.awk 1.txt 2.txt
chr8 12 24 . . + chr1 11 12 XX4 - 1 1.3020000
chr8 12 24 . . + chr1 11 12 XX4 - 2 -28.887787
chr8 12 24 . . + chr1 11 12 XX4 - 3 2.4909883
chr8 12 24 . . + chr1 11 12 XX4 - 4 0.0009223
chr8 12 24 . . + chr1 11 12 XX4 - 5 0.0009223
chr8 12 24 . . + chr1 11 12 XX4 - 6 0.10176998

Extended AWK solution:
awk 'NR==FNR{ s=SUBSEP; k=$1 s $6 s $7; a[k]=(k in a? a[k]"#":"")$3"\t"$12; next }
{ s=SUBSEP; k=$7 s $8 s $9 }
k in a{ len=split(a[k], b, "#"); for (i=1;i<=len;i++) print $0,b[i] }' 2.txt 1.txt
s=SUBSEP; k=$1 s $6 s $7 - constructing key k value comprised of the 1st, 6th and 7th fields of hte file 2.txt
a[k]=(k in a? a[k]"#":"")$3"\t"$12 - concatenate the $3"\t"$12 sequences with custom separator # within the same group (group presented by k)
s=SUBSEP; k=$7 s $8 s $9 - constructing key k value comprised of the 7th, 8th and 9th fields of the file 1.txt
len=split(a[k], b, "#"); - split previously accumulated sequences into array b by separator #
The output:
chr8 12 24 . . + chr1 11 12 XX4 - 1 1.3020000
chr8 12 24 . . + chr1 11 12 XX4 - 2 -28.887787
chr8 12 24 . . + chr1 11 12 XX4 - 3 2.4909883
chr8 12 24 . . + chr1 11 12 XX4 - 4 0.0009223
chr8 12 24 . . + chr1 11 12 XX4 - 5 0.0009223
chr8 12 24 . . + chr1 11 12 XX4 - 6 0.10176998

Related

How to merge two files together matching exactly by 2 columns?

I have file 1 with 5778 lines with 15 columns.
Sample from output_armlympho.txt:
NUMBER CHROM POS ID REF ALT A1 TEST OBS_CT BETA SE L95 U95 T_STAT P
42484 1 18052645 rs260514:18052645:G:A G A G ADD 1597 0.0147047 0.0656528 -0.113972 0.143382 0.223977 0.822804
42485 1 18054638 rs35592535:18054638:GC:G GC G G ADD 1597 0.0138673 0.0269643 -0.0389816 0.0667163 0.514286 0.607124
42486 7 18054785 rs1572792:18054785:G:A G A A ADD 1597 -0.0126002 0.0256229 -0.0628202
I have another file with 25958247 lines and 16 columns
Sample from file1:
column1 column2 column3 column4 column5 column6 column7 column8 column9 column10 column11 column12 column13 column14 column15 column16
1 chr1_10000044_A_T_b38 ENS 171773 29 30 0.02 0.33 0.144 0.14 chr1 10000044 A T chr1 10060102
2 chr7_10000044_A_T_b38 ENS -58627 29 30 0.024 0.26 0.16 0.15 chr7 10000044 A T chr7 18054785
4 chr1_10000044_A_T_b38 ENS 89708 29 30 0.0 0.03 -0.0 0.038 chr1 10000044 A T chr1 18054638
5 chr1_10000044_A_T_b38 ENS -472482 29 30 0.02 0.16 0.11 0.07 chr1 10000044 A T chr1 18052645
I want to merge these files together so that the second and third column from file 1 (CHROM POS) exactly matches the 15th and 16th columns from file 2 (column15 column16). However a problem is that in column15, the format is chr[number] e.g. chr1 and in file 1 is just 1. So I need a way to match 1 to chr1 or 7 to chr7 and via position. There may also be repeated lines in file 2. E.g. repeated values that are the same in column15 and column16. Both files aren't ordered in the same way.
Expected output: (outputs all the columns from file 1 and 2).
column1 column2 column3 column4 column5 column6 column7 column8 column9 column10 column11 column12 column13 column14 column15 column16 NUMBER CHROM POS ID REF ALT A1 TEST OBS_CT BETA SE L95 U95 T_STAT P
2 chr7_10000044_A_T_b38 ENS -58627 29 30 0.024 0.26 0.16 0.15 chr7 10000044 A T chr7 18054785 42486 7 18054785 42486 7 18054785 rs1572792:18054785:G:A G A A ADD 1597 -0.0126002 0.0256229 -0.0628202
4 chr1_10000044_A_T_b38 ENS 89708 29 30 0.0 0.03 -0.0 0.038 chr1 10000044 A T chr1 18054638 42485 1 18054638 rs35592535:18054638:GC:G GC G G ADD 1597 0.0138673 0.0269643 -0.0389816 0.0667163 0.514286 0.607124
5 chr1_10000044_A_T_b38 ENS -472482 29 30 0.02 0.16 0.11 0.07 chr1 10000044 A T chr1 18052645 42484 1 18052645 rs260514:18052645:G:A G A G ADD 1597 0.0147047 0.0656528 -0.113972 0.143382 0.223977 0.822804
Current attempt:
awk 'NR==FNR {Tmp[$3] = $16 FS $4; next} ($16 in Tmp) {print $0 FS Tmp[$16]}' output_armlympho.txt file1 > test
Assumptions:
within the file output_armlympho.txt the combination of the 2nd and 3rd columns are unique
One awk idea:
awk '
FNR==1 { if (header) print $0,header; else header=$0; next }
FNR==NR { lines["chr" $2,$3]=$0; next }
($15,$16) in lines { print $0, lines[$15,$16] }
' output_armlympho.txt file1
This generates:
column1 column2 column3 column4 column5 column6 column7 column8 column9 column10 column11 column12 column13 column14 column15 column16 NUMBER CHROM POS ID REF ALT A1 TEST OBS_CT BETA SE L95 U95 T_STAT P
2 chr7_10000044_A_T_b38 ENS -58627 29 30 0.024 0.26 0.16 0.15 chr7 10000044 A T chr7 18054785 42486 7 18054785 rs1572792:18054785:G:A G A A ADD 1597 -0.0126002 0.0256229 -0.0628202
4 chr1_10000044_A_T_b38 ENS 89708 29 30 0.0 0.03 -0.0 0.038 chr1 10000044 A T chr1 18054638 42485 1 18054638 rs35592535:18054638:GC:G GC G G ADD 1597 0.0138673 0.0269643 -0.0389816 0.0667163 0.514286 0.607124
5 chr1_10000044_A_T_b38 ENS -472482 29 30 0.02 0.16 0.11 0.07 chr1 10000044 A T chr1 18052645 42484 1 18052645 rs260514:18052645:G:A G A G ADD 1597 0.0147047 0.0656528 -0.113972 0.143382 0.223977 0.822804

Counting number of zeros in a row, adding count to new column [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I have a tab-delimited table that looks like this:
chr1 100 110 + 2 3 0 8 6
chr1 150 200 + 1 4 0 2 0
chr1 200 220 + 1 4 2 0 0
chr1 250 260 + 4 2 6 1 3
I would like to count how many zeros are in columns 5-9 and add that number to column 10:
chr1 100 110 + 2 3 0 8 6 1
chr1 150 200 + 1 4 0 2 0 2
chr1 200 220 + 1 4 2 0 0 2
chr1 250 260 + 4 2 6 1 3 0
Ultimately, the goal is to subset only those rows with no more than 4 zeros (at least 2 columns being non-zero). I know how to do this subset with awk but I don't know how to count the zeros in those columns. If there is a simpler way to just require that at least two columns be non-zero between columns 5-9 that would be ideal.
rethab's answer perfectly answers your first requirement of adding an extra column. This answers your second requirement (print only lines with less than 4 zeros). With awk (tested with GNU awk), simply count the non-zero fields between field 5 et field 9 (variable nz), and print only if it is greater or equal 2:
$ cat foo.txt
chr1 100 110 + 2 3 0 8 6
chr1 150 200 + 1 4 0 2 0
chr1 250 260 + 0 0 0 1 0
chr1 200 220 + 1 4 2 0 0
chr1 250 260 + 4 2 6 1 3
$ awk '{nz=0; for(i=5;i<=9;i++) nz+=($i!=0)} nz>=2' foo.txt
chr1 100 110 + 2 3 0 8 6
chr1 150 200 + 1 4 0 2 0
chr1 200 220 + 1 4 2 0 0
chr1 250 260 + 4 2 6 1 3
This script counts the zeros and appends them as the last column:
awk '{
cnt=0
for (i=5;i<=9;i++) {
cnt+=($i==0)
}
print $0, cnt
}' inputs.txt
note that $i==0 yields 1 if the condition is true and 0 if not. Therefore, this can be used as the increment for the counter.
You can use gsub which returns the number of sustitutions per line (here per s string) and then print the number:
awk '{s=$5$6$7$8$9;x=gsub(/0/,"&",s);print $0, x}' file
chr1 100 110 + 2 3 0 8 6 1
chr1 150 200 + 1 4 0 2 0 2
chr1 200 220 + 1 4 2 0 0 2
chr1 250 260 + 4 2 6 1 3 0

awk cumulative sum in on dimension

Good afternoon,
I would like to make a cumulative sum for each column and line in awk.
My in file is :
1 2 3 4
2 5 6 7
2 3 6 5
1 2 1 2
And I would like : per column
1 2 3 4
3 7 9 11
5 10 15 16
6 12 16 18
6 12 16 18
And I would like : per line
1 3 5 9 9
2 7 13 20 20
2 5 11 16 16
1 3 4 6 6
I did the sum per column as :
awk '{ for (i=1; i<=NF; ++i) sum[i] += $i}; END { for (i in sum) printf "%s ", sum[i]; printf "\n"; }' test.txt # sum
And per line .
awk '
BEGIN {FS=OFS=" "}
{
sum=0; n=0
for(i=1;i<=NF;i++)
{sum+=$i; ++n}
print $0,"sum:"sum,"count:"n,"avg:"sum/n
}' test.txt
But I would like to print all the lines and columns.
Do you have an idea?
It looks like you have all the correct information available, all you are missing is the printout statements.
Is this what you are looking for?
accumulated sum of the columns:
% cat foo
1 2 3 4
2 5 6 7
2 3 6 5
1 2 1 2
% awk '{ for (i=1; i<=NF; ++i) {sum[i]+=$i; $i=sum[i] }; print $0}' foo
1 2 3 4
3 7 9 11
5 10 15 16
6 12 16 18
accumulated sum of the rows:
% cat foo
1 2 3 4
2 5 6 7
2 3 6 5
1 2 1 2
% awk '{ sum=0; for (i=1; i<=NF; ++i) {sum+=$i; $i=sum }; print $0}' foo
1 3 6 10
2 7 13 20
2 5 11 16
1 3 4 6
Both these make use of the following :
each variable has value 0 by default (if used numerically)
I replace the field $i with what the sum value
I reprint the full line with print $0
row sums with repeated last element
$ awk '{s=0; for(i=1;i<=NF;i++) $i=s+=$i; $i=s}1' file
1 3 6 10 10
2 7 13 20 20
2 5 11 16 16
1 3 4 6 6
$i=s sets the index value (now incremented to NF+1) to the sum and 1 prints the line with that extra field.
columns sums with repeated last row
$ awk '{for(i=1;i<=NF;i++) c[i]=$i+=c[i]}1; END{print}' file
1 2 3 4
3 7 9 11
5 10 15 16
6 12 16 18
6 12 16 18
END{print} repeats the last row
ps. your math seems to be wrong for the row sums

How to subset a tsv file based on a pattern?

I have two files. One file is a tab separated file containing multiple columns. the other file is a list of gene names. I have to extract only those rows which have the genes listed in file 2 are present in file 1.
I tried the below command but it extract all the rows:
awk 'NR==FNR{a[$0]=1;next} {for(i in a){if($10~i){print;break}}}' File2 file1
File1:
Input line ID Chrom Position Strand Ref. base(s) Alt. base(s) Sample ID HUGO symbol Sequence ontology Protein sequen
3 VAR113_NM-02_TUMOR_DNA chr1 11082255 + G T NM-02_TUMOR_DNA TARDBP MS K263N . PASS het 3 25
4 VAR114_NM-02_TUMOR_DNA chr1 15545868 + G T NM-02_TUMOR_DNA TMEM51 MS V131F . PASS het 3 13
6 VAR116_NM-02_TUMOR_DNA chr1 20676680 + C T NM-02_TUMOR_DNA VWA5B1 SY S970S . PASS het 4 34
7 rs149021429_NM-02_TUMOR_DNA chr1 21554495 + C A NM-02_TUMOR_DNA ECE1 SY S570S . PASS het 3
16 VAR126_NM-02_TUMOR_DNA chr1 39905109 + C T NM-02_TUMOR_DNA MACF1 SY V4069V . PASS het 4 17
21 VAR131_NM-02_TUMOR_DNA chr1 101387378 + G T NM-02_TUMOR_DNA SLC30A7 MS A275S . PASS het 4 45
24 VAR134_NM-02_TUMOR_DNA chr1 113256156 + C A NM-02_TUMOR_DNA PPM1J MS S135I . PASS het 3 9
25 rs201097299_NM-02_TUMOR_DNA chr1 145326106 + A T NM-02_TUMOR_DNA NBPF10 MS M1327L . PASS het 5
26 VAR136_NM-02_TUMOR_DNA chr1 149859281 + T C NM-02_TUMOR_DNA HIST2H2AB SY E62E . PASS het 11
27 VAR137_NM-02_TUMOR_DNA chr1 150529801 + C A NM-02_TUMOR_DNA ADAMTSL4 SY S679S . PASS het 3
28 rs376491237_NM-02_TUMOR_DNA chr1 150532649 + C A NM-02_TUMOR_DNA ADAMTSL4 SY R1068R . PASS het
34 VAR144_NM-02_TUMOR_DNA chr1 160389277 + T A NM-02_TUMOR_DNA VANGL2 SY L226L . PASS het 3 6
35 VAR145_NM-02_TUMOR_DNA chr1 161012389 + C A NM-02_TUMOR_DNA USF1 MS D44Y . PASS het 3 32
37 VAR147_NM-02_TUMOR_DNA chr1 200954042 + G T NM-02_TUMOR_DNA KIF21B MS R1250S . PASS het 3 21
41 rs191896925_NM-02_TUMOR_DNA chr1 207760805 + G T NM-02_TUMOR_DNA CR1 MS G1869W . PASS het 3
42 VAR152_NM-02_TUMOR_DNA chr1 208218427 + C A NM-02_TUMOR_DNA PLXNA2 SY G1208G . PASS het 3 13
43 VAR153_NM-02_TUMOR_DNA chr1 222715425 + A G NM-02_TUMOR_DNA HHIPL2 SY Y349Y . PASS het 10 41
44 VAR154_NM-02_TUMOR_DNA chr1 222715452 + T A NM-02_TUMOR_DNA HHIPL2 SY G340G . PASS het 5 46
45 VAR155_NM-02_TUMOR_DNA chr1 223568296 + G A NM-02_TUMOR_DNA C1orf65 SY G493G . PASS het 3 25
48 VAR158_NM-02_TUMOR_DNA chr2 8931258 + G A NM-02_TUMOR_DNA KIDINS220 MS P458L . PASS het 3 13
51 VAR161_NM-02_TUMOR_DNA chr2 37229656 + C A NM-02_TUMOR_DNA HEATR5B MS G1704C . PASS het 4 9
60 VAR170_NM-02_TUMOR_DNA chr2 84775506 + G T NM-02_TUMOR_DNA DNAH6 MS Q427H . PASS het 3 20
63 VAR173_NM-02_TUMOR_DNA chr2 86378563 + C A NM-02_TUMOR_DNA IMMT MS A420S . PASS het 6 29
64 VAR174_NM-02_TUMOR_DNA chr2 86716546 + G T NM-02_TUMOR_DNA KDM3A MS C1140F . PASS het 3 18
65 VAR175_NM-02_TUMOR_DNA chr2 96852612 + C A NM-02_TUMOR_DNA STARD7 SY L323L . PASS het 2 2
67 VAR177_NM-02_TUMOR_DNA chr2 121747740 + C A NM-02_TUMOR_DNA GLI2 MS P1417H . PASS het 2 2
71 rs199770435_NM-02_TUMOR_DNA chr2 130872871 + C T NM-02_TUMOR_DNA POTEF SY G184G . PASS het 8
72 rs199695856_NM-02_TUMOR_DNA chr2 132919171 + A G NM-02_TUMOR_DNA ANKRD30BL SY H36H . PASS het
73 rs111295191_NM-02_TUMOR_DNA chr2 132919192 + G A NM-02_TUMOR_DNA ANKRD30BL SY N29N . PASS het
76 VAR186_NM-02_TUMOR_DNA chr2 167084231 + T A NM-02_TUMOR_DNA SCN9A SY A1392A . PASS het 3 19
77 VAR187_NM-02_TUMOR_DNA chr2 168100115 + C G NM-02_TUMOR_DNA XIRP2 MS T738S . PASS het 9 49
78 VAR188_NM-02_TUMOR_DNA chr2 179343033 + G T NM-02_TUMOR_DNA FKBP7 MS A65D . PASS het 3 7
79 VAR189_NM-02_TUMOR_DNA chr2 179544108 + G C NM-02_TUMOR_DNA TTN MS P11234A . PASS het 3 17
82 VAR192_NM-02_TUMOR_DNA chr2 220074164 + G T NM-02_TUMOR_DNA ZFAND2B MS E92D . PASS het 2 2
83 VAR193_NM-02_TUMOR_DNA chr2 220420892 + C A NM-02_TUMOR_DNA OBSL1 MS G1487W . PASS het 3 9
84 rs191578275_NM-02_TUMOR_DNA chr2 233273263 + C A NM-02_TUMOR_DNA ALPPL2 MS P279Q . PASS het 3
86 VAR196_NM-02_TUMOR_DNA chr2 241815391 + G T NM-02_TUMOR_DNA AGXT SY L272L . PASS het 3 10
88 VAR198_NM-02_TUMOR_DNA chr3 9484995 + C T NM-02_TUMOR_DNA SETD5 SG R361* . PASS het 3 18
96 VAR206_NM-02_TUMOR_DNA chr3 49848502 + G T NM-02_TUMOR_DNA UBA7 MS P382H . PASS het 5 38
102 VAR212_NM-02_TUMOR_DNA chr3 58302669 + G T NM-02_TUMOR_DNA RPP14 MS L89F . PASS het 3 30
103 VAR213_NM-02_TUMOR_DNA chr3 63981750 + C A NM-02_TUMOR_DNA ATXN7 MS T751K . PASS het 3 13
104 rs146577101_NM-02_TUMOR_DNA chr3 97868656 + C T NM-02_TUMOR_DNA OR5H14 MS R143W . PASS het 4
107 rs58176285_NM-02_TUMOR_DNA chr3 123419183 + G A NM-02_TUMOR_DNA MYLK SY A1044A . PASS het 18
108 VAR218_NM-02_TUMOR_DNA chr3 123419189 + C T NM-02_TUMOR_DNA MYLK SY K1042K . PASS het 23 174
115 VAR225_NM-02_TUMOR_DNA chr3 183753779 + C A NM-02_TUMOR_DNA HTR3D MS P91T . PASS het 4 48
File2:
FBN1
HELZ
RALGPS2
DYNC1I2
NFE2L2
POSTN
INO80
I want those row which contains these genes.
So if I am following you correctly you just want to search $9 in file1 using the genes in file2 and I add MYLK to the list I get:
Maybe:
awk 'NR==FNR{A[$1];next}$9 in A' file2 file1
**empty line** (since `MYLK` was found after the line break it is included
107 rs58176285_NM-02_TUMOR_DNA chr3 123419183 + G A NM-02_TUMOR_DNA MYLK SY A1044A . PASS het 18
108 VAR218_NM-02_TUMOR_DNA chr3 123419189 + C T NM-02_TUMOR_DNA MYLK SY K1042K . PASS het 23 174
To remove the line break from the output:
awk 'NR==FNR{A[$1];next}$9 in A' file2 file1 | awk '!/^$/'
107 rs58176285_NM-02_TUMOR_DNA chr3 123419183 + G A NM-02_TUMOR_DNA MYLK SY A1044A . PASS het 18
108 VAR218_NM-02_TUMOR_DNA chr3 123419189 + C T NM-02_TUMOR_DNA MYLK SY K1042K . PASS het 23 174

I want to match two files based on different fields in the two files using awk?

Hi I have a big data set and i want to match two files based on $5 from file 1 and $1 or $3 of file 2 and print file 1 which match with file 2. In addition, i want to print $5 and $6 of file 2 in file 1 after matching.
file 1
7 81 1 47 32070
7 83 1 67 29446
7 92 1 84 28234
file 2
32070 0 0 19360101 HF 8 0 M C
28234 0 0 19350101 HF 8 0 M C
124332 0 0 19340101 HF 8 0 M C
29446 0 0 19340101 HF 8 0 M C
I would like to print like this
7 81 1 47 32070 HF 8
7 83 1 67 29446 HF 8
7 92 1 84 28234 HF 8
This awk one-liner should do the job:
awk 'NR==FNR{a[$1]=$5 FS $6;next}$0=$0 FS a[$NF]' f2 f1
If give it a test on your example input files:
kent$ awk 'NR==FNR{a[$1]=$5 FS $6;next}$0=$0 FS a[$NF]' f2 f1
7 81 1 47 32070 HF 8
7 83 1 67 29446 HF 8
7 92 1 84 28234 HF 8