awk to update file with sum of matching `another file - awk

In the awk below I am trying to add a penalty to a score to each matching $1 in file2 based on the sum of $3+$4 (variable TL) in file1. Then the $4 value in file1 is divided by TL and multiplied by 100 (this valvue is variable S). Finally, $2 in file2 -S gives the updated $2 result in file2. Since math is not my strong suit there probaly is a better way of doing this, but this is what I could think off. Thank you :).
file1 space delimited
ACP5 4 1058 0
ACTB5 10 1708 79
ORAI1 2 952 0
TBX1 9 1932 300
file2 tab-delimited
ACP5 100.00
ACTB 100.00
ORAI1 94.01
TBX1 77.23
desired output tab-delimited the --- is an example calculation and not part of the output
ACP5 100.00
ACTB 89.59 ---- $3+$4=1787 this is TL (comes from file1), $4/TL*100 is 4.42, $2 in file2 is 100 - 4.42 = 95.58 ----
ORAI1 94.01
TBX1 63.79
awk
awk '
FNR==NR{ # process each line
TL[$1]=($3+$4);next} ($1 in TL) # from file1 store sum of $3 and $4 in TL
{S=(P[$4]/TL)*100;printf("%s\t %.2f\n",$1, $2-S) # store $4/TL from file1 in S and subtract S from $2 in file2, output two decimal places
}1' OFS="\t" file1 FS="\t" file2 # update and define input
current output
ACP5 100.00
ACTB 100.00
ORAI1 94.01
TBX1 77.23

As pointed out in the comments, the question is not completely clear. Since I can't comment yet I will give a solution that calculates the values as requested.
awk '
NF==4 { S[$1] = 100 * $4 / ($3 + $4) }
NF==2 { printf("%s\t%.2f\n", $1, $2 - S[$1]) }
' file1 file2
file1
ACP5 4 1058 0
ACTB 10 1708 79
ORAI1 2 952 0
TBX1 9 1932 300
file2
ACP5 100.00
ACTB 100.00
ORAI1 94.01
TBX1 77.23
output
ACP5 100.00
ACTB 95.58
ORAI1 94.01
TBX1 63.79
Explanation:
The script works by calculating and storing the S value in a associative array using $1 as the key. This is done in a block filtered by NF==4, so it will only runs for the first file (the only one with 4 fields). Finally, for NF==2 representing the second file, the result is printed using a printf and by subtracting the corresponding S value from $2.
Observation: Keep in mind that as #kvantour pointed out the example you provided does not follow the indications in the question. For example, where did the 89.59 value come from? The explanation ends up with 95.58 as the result just like the output of the script I provided

Related

Modify tab delimited txt file

I want to modify tab delimited txt file using linux commands sed/awk/or any other method
This is an example of tab delimited txt file which I want to modify for R boxplot input:
----start of input format---------
chr8 38277027 38277127 Ex8_inner
25425 8 100 0.0800000
chr8 38277027 38277127 Ex8_inner
25426 4 100 0.0400000
chr9 38277027 38277127 Ex9_inner
25427 9 100 0.0900000
chr9 38277027 38277127 Ex9_inner
25428 1 100 0.0100000
chr10 38277027 38277127 Ex10_inner
30935 1 100 0.0100000
chr10 38277027 38277127 Ex10_inner
31584 1 100 0.0100000
all 687 1 1000 0.0010000
all 694 1 1000 0.0010000
all 695 1 1000 0.0010000
all 697 1 1000 0.0010000
all 699 6 1000 0.0060000
all 700 2 1000 0.0020000
all 723 7 1000 0.0070000
all 740 8 1000 0.0080000
all 742 1 1000 0.0010000
all 761 5 1000 0.0050000
all 814 2 1000 0.0020000
all 821 48 1000 0.0480000
------end of input file format------
I want it to be modified so that 4th column of odd rows becomes 1st column and 2nd column of the even rows (1st column is blank) becomes 2nd column. Rows starting with "all" gets deleted.
This is how output file should look:
-----start of the output file----
Ex8_inner 25425
Ex8_inner 25426
Ex9_inner 25427
Ex9_inner 25428
Ex10_inner 30935
Ex10_inner 31584
-----end of the output file----
EDIT: As OP has changed Input_file sample a bit so adding code too it.
awk --re-interval 'match($0,/Exon[0-9]{1,}/){val=substr($0,RSTART,RLENGTH);getline;sub(/^ +/,"",$1);print val,$1}' Input_file
NOTE: My awk is old version to I added --re-interval to it you need not to add it in case you have recent version of it too.
With single awk following may help you on same too.
awk '/Ex[0-9]+_inner/{val=$NF;getline;sub(/^ +/,"",$1);print val,$1}' Input_file
Explanation: Adding explanation too here for same.
awk '
/Ex[0-9]+_inner/{ ##Checking condition here if a line contains string Ex then digits _inner if yes then do following actions.
val=$NF; ##Creating variable named val whose value is $NF(last field of current line).
getline; ##using getline which is out of the box keyword of awk to take the cursor to the next line from current line.
sub(/^ +/,"",$1); ##Using sub utility of awk to substitute initial space of first field with NULL.
print val,$1 ##Printing variable named val and first field value here.
}
' Input_file ##Mentioning the Input_file name here.
another awk
$ awk '/^all/{next}
!/^chr/{printf "%s\n", $1; next}
{printf "%s ", $NF}' file
Ex8_inner 25425
Ex8_inner 25426
Ex9_inner 25427
Ex9_inner 25428
Ex10_inner 30935
Ex10_inner 31584
or perhaps
$ awk '!/^all/{if(/^chr/) printf "%s", $NF OFS; else print $1}' file

awk to add missing to sequential order if id not found in file

The awk below will look for the ids in file1 in $2 of file2 and if they match print the $2. If an id is missing or not found in file2 (like ARRR and AAAA), I can not figure out how to add it to them to the lines in the output as missing in $3 following the same format. That is with the next sequential number in $1, the id from file1 in $2, and the word missing in $3. Thank you :).
awk
awk -F'\t' 'NR==FNR{A[$1];next}$2 in A' file1 file2
file1 space delimited
AARS
AARS2
AARS2;TMEM151B
ARRR
AAAS
AAAA
AADAC
file2 tab-delimited
1 AARS 100.00
2 AARS2 100.00
3 AARS2;TMEM151B 100.00
4 AAAS 100.00
5 AADAC 100.00
desired output tab-delimited
1 AARS 100.00
2 AARS2 100.00
3 AARS2;TMEM151B 100.00
4 AAAS 100.00
5 AADAC 100.00
6 ARRR missing
7 AAAA missing
awk solution:
awk 'NR==FNR{ a[$0]; next }$2 in a{ delete a[$2] }
END{ for(i in a) print ++FNR,i,"missing" }1' file1 OFS='\t' file2
The output:
1 AARS 100.00
2 AARS2 100.00
3 AARS2;TMEM151B 100.00
4 AAAS 100.00
5 AADAC 100.00
6 AAAA missing
7 ARRR missing

extracting data from a column based on another column

I have some files as shown below. I would like to extract the values of $5 based on $1.
file1
sam 60.2 143 40.4 19.8
mathew 107.9 144 35.6 72.3
baby 48.1 145 17.8 30.3
rehna 47.2 146 21.2 26.0
sam 69.9 147 .0 69.9
file2
baby 58.9 503 47.5 11.4
daisy 20.8 504 20.4 .4
arch 61.1 505 12.3 48.8
sam 106.6 506 101.6 5.0
rehna 73.5 507 35.9 37.6
sam 92.0 508 61.1 30.9
I used the following code to extract $5.
awk '$1 == "rehna" { print $5 }' *
awk '$1 == "sam" { print $5 }' *
I would like to get the output as shown below
rehna sam
26.0 19.8
37.6 69.9
5.0
30.9
How do I achieve this? your suggestions would be appreciated!
The simplest is probably to paste the results together:
#!/bin/bash
function myawk {
awk -v name="$1" 'BEGIN {print name} $1 == name { print $5 }' file1 file2
}
paste <(myawk rehna) <(myawk sam)
Running this produces the results you requested (with TAB as the separator character). See paste documentation for other options.
Update: peak's answer has since wrapped this approach in a function, in the spirit of DRY. If you want more background information, read on.
Assuming Bash, Ksh, or Zsh as the shell:
printf '%s\t%s\n' 'rehna' 'sam'
paste \
<(awk '$1 == "rehna" { print $5 }' *) \
<(awk '$1 == "sam" { print $5 }' *)
The above produces tab-separated output.
paste is a POSIX utility that outputs corresponding lines from its input files, by default separated with tabs; e.g., paste fileA fileB yields:
<line 1 from fileA>\t<line 1 from fileB>
<line 2 from fileA>\t<line 2 from fileB>
...
If any input file runs out of lines, it supplies empty lines.
In the case at hand, the respective outputs from the awk commands are used as input files, using process substitution (<(...)).

Comparing two lists and printing select columns from each list

I want to compare two lists and print some columns from one, and some from the other if two match. I suspect I'm close but I suppose it's better to check..
1st file: Data.txt
101 0.123
145 0.119
242 0.4
500 0.88
2nd File: Map.txt
red 1 99
blue 3 101
rob 3 240
ted 7 500
So, if I want to compare the 3rd column of file2 against the 1st of file1 and print the 1st column of file2 and all of file1, I tried awk 'NR==FNR {a[$3];next}$1 in a{print$0}' file2 file1
but that only prints matches in file1. I tried adding x=$1 in the awk. i.e. awk 'NR==FNR {x=$1;a[$3];next}$1 in a{print x $0} file2 file1 but that saves only one value of $1 and outputs that value every line. I also tried adding $1 into a[$3], which is obviously wrong thus giving zero output.
Ideally I'd like to get this output:
blue 145 0.119
ted 500 0.88
which is the 1st column of file2 and the 3rd column of file2 matched to 1st column of file1, and the rest of file1.
You had it almost exactly in your second attempt. Just instead of assigning the value of $1 to a scalar you can stash it in the array for later use.
awk 'NR==FNR {a[$3]=$1; next} $1 in a {print a[$1], $0}' file2.txt file1.txt
$ cat file1.txt
101 0.123
145 0.119
242 0.4
500 0.88
$ cat file2.txt
red 1 99
blue 3 101
rob 3 240
ted 7 500
$ awk 'NR==FNR {a[$3]=$1; next} $1 in a {print a[$1], $0}' file2.txt file1.txt
blue 101 0.123
ted 500 0.88

Compare two files and append the values, leave the mismatches as such in the output file

I'm trying to match two files,file1.txt(50,000 lines), file2.txt(55,000 lines). I want to campare file2 to file 1 extract the values of column 2 and 3 and leave the mismatches as such. Output file must contain all the ids from file2 i.e., it should have 55000 lines. Note: All the ids in file 1 are not present in file2. i.e the actual matches could be less than 50,000.
file1.txt
ab1 12 345
ab2 9 456
gh67 6 987
file2.txt
ab2 0 0
ab1 0 345
nh7 0 0
gh67 6 987
Output
ab2 9 456
ab1 12 345
nh7 0 0
gh67 6 987
This is what i tried but it only print the matches (so instead of 55,000 lines i have 49,000 lines in my output file)
awk "NR==FNR {f[$1]=$0;next}$1 in f{print f[$1],$0}" file1.txt file2.txt >output.txt
This awk script will work
NR == FNR {
a[$1] = $0
next
}
$1 in a {
split(a[$1], b)
print $1, (b[2] == $2 ? $2 : b[2]), (b[3] == $3 ? $3 : b[3])
}
!($1 in a)
If you save this as a.awk and run
awk -f a.awk foo.txt foo1.txt
This will output
ab2 9 456
ab1 12 345
nh7 0 0
gh67 6 987