Process multiple files with awk

Process multiple files with awk - awk

I would like to count the number of points in each interval. I have the positions of the points in the first file and the intervals in the second. First I store the point attributes in two arrays(pos and name) and then i want to loop over them in order to determine wheter it belongs to the given interval ($1 is the name and $2 is the start and $3 is the end of the interval). I have the following code:
awk 'NR==FNR{name[NR]=$1;pos[NR]=$2;next}; {for (i in name) if (name[i] == $1 && pos[i] >= $2 && pos[i] <= $3) {sum[NR] += 1;}} END {for (i = 1; i <=length(sum); i++) {print sum[i]}} file1 file2 > out'
I have a syntax error: "syntax error near unexpected token `i"
I am beginner in awk. Any help is highly appriciated. Thanks
awk '
NR==FNR{
name[NR]=$1
pos[NR]=$2
next
}
{
for(i in name){
if(name[i] == $1 && pos[i] >= $2 && pos[i] <= $3){ sum[FNR] += 1; }
}
}
END {
for(i = 1; i <=FNR; i++){
print sum[i];
}
}
' points windows > output
points:
chr1 52
chr1 65
chr2 120
chr2 101
chr2 160
chr3 20
chr4 50
windows:
chr1 0 100
chr1 100 200
chr2 0 100
chr2 100 200
chr3 0 100
chr3 100 200
chr4 0 100
chr5 0 100
chr6 0 100
chr6 100 200
chr7 0 100
chr8 0 100
gave me the desired output:
2
3
1
1
Thank You

Your ' is in wrong place and awk command is not ending properly, could you please try following. Couldn't test it since no samples are given.
awk 'NR==FNR{name[NR]=$1;pos[NR]=$2;next}; {for (i in name) if (name[i] == $1 && pos[i] >= $2 && pos[i] <= $3) {sum[NR] += 1;}} END {for (i = 1; i <=length(sum); i++) {print sum[i]}}' file1 file2
Non-one liner form of above solution.
awk '
NR==FNR{
name[NR]=$1
pos[NR]=$2
next
}
{
for(i in name){
if(name[i] == $1 && pos[i] >= $2 && pos[i] <= $3){ sum[NR] += 1 }
}
}
END{
for(i = 1; i <=length(sum); i++){
print sum[i]
}
}
' file1 file2 > out
As per sir #Ed Morton 's comment following could be few recommendations: Again these are not tested since no samples were given but you could try to apply them.
sum[NR] should be as sum[FNR] if in case you want to put index as per line number of Input_file2, why because difference between NR and FNR is that NR's value will be keep keep increasing till all Input_file(s) are read but FNR value will be RESET to 1 whenever there is new Input_file is being read.
Then, length(sum) could be value of FNR because basically you may be looking for total number of times loop has to run which you could get by FNR value.

Related

Counting the number of unique values based on more than two columns in bash

I need to modify the below code to work on more than one column.
Counting the number of unique values based on two columns in bash
awk ' ##Starting awk program from here.
BEGIN{
FS=OFS="\t"
}
!found[$0]++{ ##Checking condition if 1st and 2nd column is NOT present in found array then do following.
val[$1]++ ##Creating val with 1st column inex and keep increasing its value here.
}
END{ ##Starting END block of this progra from here.
for(i in val){ ##Traversing through array val here.
print i,val[i] ##Printing i and value of val with index i here.
}
}
' Input_file ##Mentioning Input_file name here.
Table to count how many of each double (all DIS)
patient sex DISa DISb DISc DISd DISe DISf DISg DISh DISi
patient1 male 550.1 550.5 594.1 594.3 594.8 591 1019 960.1 550.1
patient2 female 041 208 250.2 276.14 426.32 550.1 550.5 558 041
patient3 female NA NA NA NA NA NA NA 041 NA
The output I need is:
550.1 3
550.5 2
594.1 1
594.3 1
594.8 1
591 1
1019 1
960.1 1
550.1 1
041 3
208 1
250.2 1
276.14 1
426.32 1
558 1

Consider this awk:
awk -v OFS='\t' 'NR > 1 {for (i=3; i<=NF; ++i) if ($i+0 == $i) ++fq[$i]} END {for (i in fq) print i, fq[i]}' file
276.14 1
960.1 1
594.3 1
426.32 1
208 1
041 3
594.8 1
550.1 3
591 1
1019 1
558 1
550.5 2
250.2 1
594.1 1
A more readable form:
awk -v OFS='\t' '
NR > 1 {
for (i=3; i<=NF; ++i)
if ($i+0 == $i)
++fq[$i]
}
END {
for (i in fq)
print i, fq[i]
}' file
$i+0 == $i is a check for making sure column value is numeric.

If the ordering must be preserved, then you need an additional array b[] to keep the order each number is encountered, e.g.
awk '
BEGIN { OFS = "\t" }
FNR > 1 {
for (i=3;i<=NF;i++)
if ($i~/^[0-9]/) {
if (!($i in a))
b[++n] = $i;
a[$i]++
}
}
END {
for (i=1;i<=n;i++)
print b[i], a[b[i]]
}' file
Example Use/Output
$ awk '
> BEGIN { OFS = "\t" }
> FNR > 1 {
> for (i=3;i<=NF;i++)
> if ($i~/^[0-9]/) {
> if (!($i in a))
> b[++n] = $i;
> a[$i]++
> }
> }
> END {
> for (i=1;i<=n;i++)
> print b[i], a[b[i]]
> }' patients
550.1 3
550.5 2
594.1 1
594.3 1
594.8 1
591 1
1019 1
960.1 1
041 3
208 1
250.2 1
276.14 1
426.32 1
558 1
Let me know if you have further questions.

Taking complete solution from above 2 answers(#anubhava and #David) with all respect, just adding a little tweak(of applying check for integer value here as per shown samples of OP) to their solutions and adding 2 solutions here. Written and tested with shown samples only.
1st solution: If order doesn't matter in output try:
awk -v OFS='\t' '
NR > 1 {
for (i=3; i<=NF; ++i)
if (int($i))
++fq[$i]
}
END {
for (i in fq)
print i, fq[i]
}' Input_file
2nd solution: If order matters based on David's answer try.
awk '
BEGIN { OFS = "\t" }
FNR > 1 {
for (i=3;i<=NF;i++)
if (int($i)) {
if (!($i in a))
b[++n] = $i;
a[$i]++
}
}
END {
for (i=1;i<=n;i++)
print b[i], a[b[i]]
}' Input_file

Using GNU awk for multi-char RS:
$ awk -v RS='[[:space:]]+' '$0+0 == $0' file | sort | uniq -c
3 041
1 1019
1 208
1 250.2
1 276.14
1 426.32
3 550.1
2 550.5
1 558
1 591
1 594.1
1 594.3
1 594.8
1 960.1
If the order of fields really matters just pipe the above to awk '{print $2, $1}'.

splitting data under specific value after the sum of it

How do I split the input.txt after the sum of size column 1 under 80000
input.txt
11736 textDYN.txt
65736 textMV.txt
61812 textDYN_1.txt
11750 textGB.txt
$1 < 80000
total sum = 77472 under 80000 then, it will be output as output_001.txt
11736 textDYN.txt
65736 textMV.txt
total sum = 73562 under 80000 then, it will be output as output_002.txt
61812 textDYN_1.txt
11750 textGB.txt

$ cat tst.awk
{
sub(/\r$/,"")
sum += $1
}
(NR == 1) || (sum > 80000) {
close(out)
out = sprintf("output_%03d.txt",++cnt)
sum = $1
}
{ print $0 " > " out }
$ awk -f tst.awk file
11736 textDYN.txt > output_001.txt
65736 textMV.txt > output_001.txt
61812 textDYN_1.txt > output_002.txt
11750 textGB.txt > output_002.txt
Change print $0 " > " out to print > out when done testing.

If I understand how you want to split your files, you would need to handle the case where the sum of $1 is less than 80000 (by adding to sum), the case where it is more than 80000 (updating to output to a new filename and resetting the sum) and a third rule that writes to the current filename.
You could do something like:
awk 'BEGIN {sum=0; cnt=1; fn=sprintf("output_%03d.txt", cnt++)}
sum < 80000 {sum+=$1}
sum >= 80000 {fn=sprintf("output_%03d.txt", cnt++); sum=0}
{print $0 > fn}
' file
An example using your input data in file would be:
$ awk 'BEGIN {sum=0; cnt=1; fn=sprintf("output_%03d.txt", cnt++)}
> sum < 80000 {sum+=$1}
> sum >= 80000 {fn=sprintf("output_%03d.txt", cnt++); sum=0}
> {print $0 > fn}
> ' file
Resulting Files
$ cat output_001.txt
11736 textDYN.txt
65736 textMV.txt
$ cat output_002.txt
61812 textDYN_1.txt
11750 textGB.txt
If that isn't what you were attempting, please let me know and I'm happy to help further.

awk to compare value of sub-string in field

In the below awk I am trying to extract and compare each substring in $4 that stars with p.. If the first three letters is the same as the last three (there is a digit in between) then that p. is updated to p.(3 letters)(digit)(=) --- the () are only to show that there are 3 enteries and are not needed. If the 3 letters are different then that line is unchanged. In the below file line 1 in an example. In my actual data there are about 10,000 rows wth about 50 columns, but $4 is the only one that will have these values in ut, that is te p. The format of the p. will always be three letters followed by a 1-4 digit # followed by 3 more letters. The awk attempt below I think will extract each p. and split on the ;, but I am not sure how to compare to check if the three letters are the same. Thank you :).
file tab-delimited
Chr Start ExonicFunc.refGene AAChange.refGene
chr1 155880573 synonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu110Glu;RIT1:NM_001256822:exon2:c.31G>C:p.Glu110Glu
chr1 155880573 nonsynonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu11Gln
desired output tab-delimited
Chr Start ExonicFunc.refGene AAChange.refGene
chr1 155880573 synonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu110=;RIT1:NM_001256822:exon2:c.31G>C:p.Glu110=
chr1 155880573 nonsynonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu11Gln
awk
awk '
BEGIN { OFS="\t" }
$4 ~ /:NM/ {
ostring=""
# split $4 by ";" and cycle through them
nNM=split($4,NM,";")
for (n=1; n<=nNM; n++) {
if (n>1) ostring=(ostring ";") # append ";"
if (match(NM[n],/p[.].*/)) {
# copy up to "p."
ostring=(ostring substr(NM[n],1,RSTART+1))
# Get the substring after "p."
VAL=substr(NM[n],RSTART+2)
# Get its length
lenVAL=length(VAL)
# store aa array
aa=[{while(length($4)=3){print substr($044,1,3);gsub(/^./,"")}]}' file

Extended GNU awk solution:
awk 'NR==1; NR > 1{
len = split($4, a, /\<p\.[a-zA-Z]{3}[0-9]+[a-zA-Z]{3}\>/, seps);
if (len == 1){ print; next }
res = ""
for (i=1; i < len; i++) {
s = seps[i];
if (substr(s, 3, 3) == substr(s, length(s) - 2)) {
seps[i] = substr(s, 1, length(s) - 3)"=";
}
}
for (i=1; i <= len; i++)
res = res a[i] (seps[i]? seps[i]:"");
$4 = res; print
}' FS='\t' OFS='\t' file
The output:
Chr Start ExonicFunc.refGene AAChange.refGene
chr1 155880573 synonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu110=;RIT1:NM_001256822:exon2:c.31G>C:p.Glu110=
chr1 155880573 nonsynonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu11Gln
Time performance measurement:
Input testfile:
$ wc -l testfile
10000 testfile
time(awk 'NR==1; NR > 1{
len = split($4, a, /\<p\.[a-zA-Z]{3}[0-9]+[a-zA-Z]{3}\>/, seps);
if (len == 1){ print; next }
res = ""
for (i=1; i < len; i++) {
s = seps[i];
if (substr(s, 3, 3) == substr(s, length(s) - 2)) {
seps[i] = substr(s, 1, length(s) - 3)"=";
}
}
for (i=1; i <= len; i++)
res = res a[i] (seps[i]? seps[i]:"");
$4 = res; print
}' FS='\t' OFS='\t' testfile >/dev/null)
real 0m0.269s
user 0m0.256s
sys 0m0.000s
time(awk 'BEGIN { FS=OFS="\t" }
NR>1 {
head = ""
tail = $4
while ( match(tail,/(p\.([[:alpha:]]{3})[0-9]+)([[:alpha:]]{3})/,a) ) {
head = head substr(tail,1,RSTART-1) a[1] (a[2] == a[3] ? "=" : a[3])
tail = substr(tail,RSTART+RLENGTH)
}
$4 = head tail
}
{ print }' testfile >/dev/null)
real 0m0.470s
user 0m0.416s
sys 0m0.008s

With GNU awk for the 3rd arg to match():
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR>1 {
head = ""
tail = $4
while ( match(tail,/(p\.([[:alpha:]]{3})[0-9]+)([[:alpha:]]{3})/,a) ) {
head = head substr(tail,1,RSTART-1) a[1] (a[2] == a[3] ? "=" : a[3])
tail = substr(tail,RSTART+RLENGTH)
}
$4 = head tail
}
{ print }
$ gawk -f tst.awk file
Chr Start ExonicFunc.refGene AAChange.refGene
chr1 155880573 synonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu110=;RIT1:NM_001256822:exon2:c.31G>C:p.Glu110=
chr1 155880573 nonsynonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu11Gln

awk to update value in field of out file using contents of another

In the out.txt below I am trying to use awk to update the contents of $9. The out.txt is created by the awk before the pipe |. If $9 contains a + or - then $8 of out.txt is used as a key to lookup in $2 of file2. When a match ( there will always be one) is found the $3 value of that file2 is used to update $9 of out.txt seperated by a :. So the original +6 in out.txt would be +6:NM_005101.3. The awk below is close but has syntax errors after the | that I can not seem to fix. Thank you :).
out.txt tab-delimited
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 -0 . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . .
4 chr1 949925 949925 C T downstream ISG15 +6 . . .
5 chr1 207646923 207646923 G A intronic CR2 >50 . . .
8 chr1 948840 948840 - C upstream ISG15 -6 . . .
file2 space-delimited
2 ISG15 NM_005101.3 948846-948956 949363-949919
desired output `tab-delimited'
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 -0:NM_005101.3 . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . .
4 chr1 949925 949925 C T downstream ISG15 +6:NM_005101.3 . . .
5 chr1 207646923 207646923 G A intronic CR2 >50 . . .
8 chr1 948840 948840 - C upstream ISG15 -6:NM_005101.3 . . .
Description
lines 1, 3, 5 `$9` updated with`: ` and value of `$3` in `file2`
line 2 and 4 are skipped as these do not have a `+` or `-` in them
awk
awk -v extra=50 -v OFS='\t' '
NR == FNR {
count[$2] = $1
for(i = 1; i <= $1; i++) {
low[$2, i] = $(2 + 2 * i)
high[$2, i] = $(3 + 2 * i)
mid[$2, i] = (low[$2, i] + high[$2, i]) / 2
}
next
}
FNR != 1 && $9 == "." && $12 == "." && $8 in count {
for(i = 1; i <= count[$8]; i++)
if($4 >= (low[$8, i] - extra) && $4 <= (high[$8, i] + extra)) {
if($4 > mid[$8, i]) {
sign = "+"
value = high[$8, i]
}
else {
sign = "-"
value = low[$8, i]
}
diff = (value > $4) ? value - $4 : $4 - value
$9 = (diff > 50) ? ">50" : (sign diff)
break
}
if(i > count[$8]) {
$9 = ">50"
}
}
1
' FS='[- ]' file2 FS='\t' file1 | awk if($6 == "-" || $6 == "+") printf ":" ; 'FNR==NR {a[$2]=$3; next} a[$8]{$3=a[$8]}1' OFS='\t' file2 > final.txt
bash: syntax error near unexpected token `('

As far as I can tell, your awk code is OK and your bash usage is wrong.
FS='[- ]' file2 FS='\t' file1 |
awk if($6 == "-" || $6 == "+")
printf ":" ;
'FNR==NR {a[$2]=$3; next}
a[$8]{$3=a[$8]}1' OFS='\t' file2 > final.txt
bash: syntax error near unexpected token `('
I don't know what that's supposed to do. This for sure, though: on the second line, the awk code needs to be quoted (awk 'if(....). The bash error message stems from the fact that bash is interpreting the (unquoted) awk code, and ( is not a valid shell-script token after if.

awk setting variables to make a range

I have the following two files:
File 1:
1 290 rs1345
2 450 rs5313
1 1120 rs4523
2 790 rs4325
File 2:
1 201 LDLR
2 714 APOA5
1 818 NOTCH5
1 514 TTN
I wish to isolate only the rows in file 2 in which the second field is within 100 units of the second field in file 1 (if field 1 matches):
Desired output: (note the third field is from the matching line in file1).
1 201 LDLR rs1345
2 714 APOA5 rs4325
I tried using the following code:
for i in {1..4} #there are 4 lines in file2
do
chr=$(awk 'NR=="'${i}'" { print $1 }' file2)
pos=$(awk 'NR=="'${i}'" { print $2 }' file2)
gene=$(awk 'NR=="'${i}'" { print $3 }' file2)
start=$(echo $pos | awk '{print $1-100}') #start and end variables for 100 unit range
end=$(echo $pos | awk '{print $1+100}')
awk '{if ($1=="'$chr'" && $2 > "'$start'" && $2 < "'$end'") print "'$chr'","'$pos'","'$gene'"$3}' file1
done
The code is not working, I believe something is wrong with my start and end variables, because when I echo $start, I get 414, which doesn't make sense to me and I get 614 when i echo $end.
I understand this question might be difficult to understand so please ask me if any clarification is necessary.
Thank you.

The difficulty is that $1 is not a unique key, so some care needs to be taken with the data structure to store the data in file 1.
With GNU awk, you can use arrays of arrays:
gawk '
NR==FNR {f1[$1][$2] = $3; next}
$1 in f1 {
for (val in f1[$1])
if (val-100 <= $2 && $2 <= val+100)
print $0, f1[$1][val]
}
' file1 file2
Otherwise, you have to use a one-dimensional array and stuff 2 pieces of information into the key:
awk '
NR==FNR {f1[$1,$2] = $3; next}
{
for (key in f1) {
split(key, a, SUBSEP)
if (a[1] == $1 && a[2]-100 <= $2 && $2 <= a[2]+100)
print $0, f1[key]
}
}
' file1 file2
That works with mawk and nawk (and gawk)

#!/usr/bin/python
import pandas as pd
from StringIO import StringIO
file1 = """
1 290 rs1345
2 450 rs5313
1 1120 rs4523
2 790 rs4325
"""
file2 = """
1 201 LDLR
2 714 APOA5
1 818 NOTCH5
1 514 TTN
"""
sio = StringIO(file1)
df1 = pd.read_table(sio, sep=" ", header=None)
df1.columns = ["a", "b", "c"]
sio = StringIO(file2)
df2 = pd.read_table(sio, sep=" ", header=None)
df2.columns = ["a", "b", "c"]
df = pd.merge(df2, df1, left_on="a", right_on="a", how="outer")
#query is intuitive
r = df.query("b_y-100 < b_x <b_y + 100")
print r[["a", "b_x", "c_x", "c_y"]]
output:
a b_x c_x c_y
0 1 201 LDLR rs1345
7 2 714 APOA5 rs4325
pandas is the right tool to do such tabular data manipulation.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Process multiple files with awk - awk

Related

Counting the number of unique values based on more than two columns in bash

splitting data under specific value after the sum of it

awk to compare value of sub-string in field

awk to update value in field of out file using contents of another

awk setting variables to make a range

Categories

Resources