Comparing two lists and printing select columns from each list - awk

I want to compare two lists and print some columns from one, and some from the other if two match. I suspect I'm close but I suppose it's better to check..
1st file: Data.txt
101 0.123
145 0.119
242 0.4
500 0.88
2nd File: Map.txt
red 1 99
blue 3 101
rob 3 240
ted 7 500
So, if I want to compare the 3rd column of file2 against the 1st of file1 and print the 1st column of file2 and all of file1, I tried awk 'NR==FNR {a[$3];next}$1 in a{print$0}' file2 file1
but that only prints matches in file1. I tried adding x=$1 in the awk. i.e. awk 'NR==FNR {x=$1;a[$3];next}$1 in a{print x $0} file2 file1 but that saves only one value of $1 and outputs that value every line. I also tried adding $1 into a[$3], which is obviously wrong thus giving zero output.
Ideally I'd like to get this output:
blue 145 0.119
ted 500 0.88
which is the 1st column of file2 and the 3rd column of file2 matched to 1st column of file1, and the rest of file1.

You had it almost exactly in your second attempt. Just instead of assigning the value of $1 to a scalar you can stash it in the array for later use.
awk 'NR==FNR {a[$3]=$1; next} $1 in a {print a[$1], $0}' file2.txt file1.txt
$ cat file1.txt
101 0.123
145 0.119
242 0.4
500 0.88
$ cat file2.txt
red 1 99
blue 3 101
rob 3 240
ted 7 500
$ awk 'NR==FNR {a[$3]=$1; next} $1 in a {print a[$1], $0}' file2.txt file1.txt
blue 101 0.123
ted 500 0.88

Related

Query the contents of a file using another file in AWK

I am trying to conditionally filter a file based on values in a second file. File1 contains numbers and File2 contains two columns of numbers. The question is to filter out those rows in file1 which fall within the range denoted in each row of file2.
I have a series of loops which works, but takes >12hrs to run depending on the lengths of both files. This code is noted below. Alternatively, I have tried to use awk, and looked at other questions posted on slack overflow, but I cannot figure out how to change the code appropriately.
Loop method:
while IFS= read READ
do
position=$(echo $READ | awk '{print $4}')
while IFS= read BED
do
St=$(echo $BED | awk '{print $2}')
En=$(echo $BED | awk '{print $3}')
if (($position < "$St"))
then
break
else
if (($position >= "$St" && $position <= "$En"));
then
echo "$READ" | awk '{print $0"\t EXON"}' >> outputfile
fi
fi
done < file2
done < file1
Blogs with similar questions:
awk: filter a file with another file
awk 'NR==FNR{a[$1];next} !($2 in a)' d3_tmp FS="[ \t=]" m2p_tmp
Find content of one file from another file in UNIX
awk -v FS="[ =]" 'NR==FNR{rows[$1]++;next}(substr($NF,1,length($NF)-1) in rows)' File1 File2
file1: (tab delimited)
AAA BBB 1500
CCC DDD 2500
EEE FFF 2000
file2: (tab delimited)
GGG 1250 1750
HHH 1950 2300
III 2600 2700
Expected output would retain rows 1 and 3 from file1 (in a new file, file3) because these records fall within the ranges of row 1 columns 2 and 3, and row 2 columns 2 and columns 3 of file2. In the actual files, they're not row restricted i.e. I am not wanting to look at row1 of file1 and compare to row1 of file2, but compare row1 to all rows in file2 to get the hit.
file3 (output)
AAA BBB 1500
EEE FFF 2000
One way:
awk 'NR==FNR{a[i]=$2;b[i++]=$3;next}{for(j=0;j<i;j++){if ($3>=a[j] && $3<=b[j]){print;}}}' i=0 file2 file1
AAA BBB 1500
EEE FFF 2000
Read the file2 contents and store it in arrays a and b. When file1 is read, check for the number to be between the entire a and b arrays and print.
One more option:
$ awk 'NR==FNR{for(i=$2;i<=$3;i++)a[i];next}($3 in a)' file2 file1
AAA BBB 1500
EEE FFF 2000
File2 is read and the entire range of numbers is broken up and stored into the associate array a. When we read the file1, we just need to lookup the array a.
Another awk. It may or may not make sense depending on the filesizes:
$ awk '
NR==FNR {
a[$3]=$2 # hash file2 records, $3 is key, $2 value
next
}
{
for(i in a) # for each record in file1 go thru ever element in a
if($3<=i && $3>=a[i]) { # if it falls between
print # output
break # exit loop once match found
}
}' file2 file1
Output:
AAA BBB 1500
EEE FFF 2000

awk to compare multiple columns in 2 files

I would like to compare multiple columns from 2 files and NOT print lines matching my criteria.
An example of this would be:
file1
apple green 4
orange red 5
apple yellow 6
apple yellow 8
grape green 5
file2
apple yellow 7
grape green 10
output
apple green 4
orange red 5
apple yellow 8
I want to remove lines where $1 and $2 from file1 correspond to $1 and $2 from file2 AND when $3 from file1 is smaller than $3 from file2.
I can now only do the first part of the job, that is remove lines where $1 and $2 from file1 correspond to $1 and $2 from file2 (fields are separated by tabs):
awk -F '\t' 'FNR == NR {a[$1FS$2]=$1; next} !($1FS$2 in a)' file2 file1
Could you help me apply the last condition?
Many thanks in advance!
What you are after is this:
awk '(NR==FNR){a[$1,$2]=$3; next}!(($1,$2) in a) && a[$1,$2] < $3))' <file2> <file1>
Store 3rd field value while building the array and then use it for comparison
$ awk -F '\t' 'FNR==NR{a[$1FS$2]=$3; next} !(($1FS$2 in a) && $3 > a[$1FS$2])' f2 f1
apple green 4
orange red 5
apple yellow 6
grape green 5
Better written as:
awk -F '\t' '{k = $1FS$2} FNR==NR{a[k]=$3; next} !((k in a) && $3 > a[k])' f2 f1

Modify tab delimited txt file

I want to modify tab delimited txt file using linux commands sed/awk/or any other method
This is an example of tab delimited txt file which I want to modify for R boxplot input:
----start of input format---------
chr8 38277027 38277127 Ex8_inner
25425 8 100 0.0800000
chr8 38277027 38277127 Ex8_inner
25426 4 100 0.0400000
chr9 38277027 38277127 Ex9_inner
25427 9 100 0.0900000
chr9 38277027 38277127 Ex9_inner
25428 1 100 0.0100000
chr10 38277027 38277127 Ex10_inner
30935 1 100 0.0100000
chr10 38277027 38277127 Ex10_inner
31584 1 100 0.0100000
all 687 1 1000 0.0010000
all 694 1 1000 0.0010000
all 695 1 1000 0.0010000
all 697 1 1000 0.0010000
all 699 6 1000 0.0060000
all 700 2 1000 0.0020000
all 723 7 1000 0.0070000
all 740 8 1000 0.0080000
all 742 1 1000 0.0010000
all 761 5 1000 0.0050000
all 814 2 1000 0.0020000
all 821 48 1000 0.0480000
------end of input file format------
I want it to be modified so that 4th column of odd rows becomes 1st column and 2nd column of the even rows (1st column is blank) becomes 2nd column. Rows starting with "all" gets deleted.
This is how output file should look:
-----start of the output file----
Ex8_inner 25425
Ex8_inner 25426
Ex9_inner 25427
Ex9_inner 25428
Ex10_inner 30935
Ex10_inner 31584
-----end of the output file----
EDIT: As OP has changed Input_file sample a bit so adding code too it.
awk --re-interval 'match($0,/Exon[0-9]{1,}/){val=substr($0,RSTART,RLENGTH);getline;sub(/^ +/,"",$1);print val,$1}' Input_file
NOTE: My awk is old version to I added --re-interval to it you need not to add it in case you have recent version of it too.
With single awk following may help you on same too.
awk '/Ex[0-9]+_inner/{val=$NF;getline;sub(/^ +/,"",$1);print val,$1}' Input_file
Explanation: Adding explanation too here for same.
awk '
/Ex[0-9]+_inner/{ ##Checking condition here if a line contains string Ex then digits _inner if yes then do following actions.
val=$NF; ##Creating variable named val whose value is $NF(last field of current line).
getline; ##using getline which is out of the box keyword of awk to take the cursor to the next line from current line.
sub(/^ +/,"",$1); ##Using sub utility of awk to substitute initial space of first field with NULL.
print val,$1 ##Printing variable named val and first field value here.
}
' Input_file ##Mentioning the Input_file name here.
another awk
$ awk '/^all/{next}
!/^chr/{printf "%s\n", $1; next}
{printf "%s ", $NF}' file
Ex8_inner 25425
Ex8_inner 25426
Ex9_inner 25427
Ex9_inner 25428
Ex10_inner 30935
Ex10_inner 31584
or perhaps
$ awk '!/^all/{if(/^chr/) printf "%s", $NF OFS; else print $1}' file

awk to update file with sum of matching `another file

In the awk below I am trying to add a penalty to a score to each matching $1 in file2 based on the sum of $3+$4 (variable TL) in file1. Then the $4 value in file1 is divided by TL and multiplied by 100 (this valvue is variable S). Finally, $2 in file2 -S gives the updated $2 result in file2. Since math is not my strong suit there probaly is a better way of doing this, but this is what I could think off. Thank you :).
file1 space delimited
ACP5 4 1058 0
ACTB5 10 1708 79
ORAI1 2 952 0
TBX1 9 1932 300
file2 tab-delimited
ACP5 100.00
ACTB 100.00
ORAI1 94.01
TBX1 77.23
desired output tab-delimited the --- is an example calculation and not part of the output
ACP5 100.00
ACTB 89.59 ---- $3+$4=1787 this is TL (comes from file1), $4/TL*100 is 4.42, $2 in file2 is 100 - 4.42 = 95.58 ----
ORAI1 94.01
TBX1 63.79
awk
awk '
FNR==NR{ # process each line
TL[$1]=($3+$4);next} ($1 in TL) # from file1 store sum of $3 and $4 in TL
{S=(P[$4]/TL)*100;printf("%s\t %.2f\n",$1, $2-S) # store $4/TL from file1 in S and subtract S from $2 in file2, output two decimal places
}1' OFS="\t" file1 FS="\t" file2 # update and define input
current output
ACP5 100.00
ACTB 100.00
ORAI1 94.01
TBX1 77.23
As pointed out in the comments, the question is not completely clear. Since I can't comment yet I will give a solution that calculates the values as requested.
awk '
NF==4 { S[$1] = 100 * $4 / ($3 + $4) }
NF==2 { printf("%s\t%.2f\n", $1, $2 - S[$1]) }
' file1 file2
file1
ACP5 4 1058 0
ACTB 10 1708 79
ORAI1 2 952 0
TBX1 9 1932 300
file2
ACP5 100.00
ACTB 100.00
ORAI1 94.01
TBX1 77.23
output
ACP5 100.00
ACTB 95.58
ORAI1 94.01
TBX1 63.79
Explanation:
The script works by calculating and storing the S value in a associative array using $1 as the key. This is done in a block filtered by NF==4, so it will only runs for the first file (the only one with 4 fields). Finally, for NF==2 representing the second file, the result is printed using a printf and by subtracting the corresponding S value from $2.
Observation: Keep in mind that as #kvantour pointed out the example you provided does not follow the indications in the question. For example, where did the 89.59 value come from? The explanation ends up with 95.58 as the result just like the output of the script I provided

Compare two files and append the values, leave the mismatches as such in the output file

I'm trying to match two files,file1.txt(50,000 lines), file2.txt(55,000 lines). I want to campare file2 to file 1 extract the values of column 2 and 3 and leave the mismatches as such. Output file must contain all the ids from file2 i.e., it should have 55000 lines. Note: All the ids in file 1 are not present in file2. i.e the actual matches could be less than 50,000.
file1.txt
ab1 12 345
ab2 9 456
gh67 6 987
file2.txt
ab2 0 0
ab1 0 345
nh7 0 0
gh67 6 987
Output
ab2 9 456
ab1 12 345
nh7 0 0
gh67 6 987
This is what i tried but it only print the matches (so instead of 55,000 lines i have 49,000 lines in my output file)
awk "NR==FNR {f[$1]=$0;next}$1 in f{print f[$1],$0}" file1.txt file2.txt >output.txt
This awk script will work
NR == FNR {
a[$1] = $0
next
}
$1 in a {
split(a[$1], b)
print $1, (b[2] == $2 ? $2 : b[2]), (b[3] == $3 ? $3 : b[3])
}
!($1 in a)
If you save this as a.awk and run
awk -f a.awk foo.txt foo1.txt
This will output
ab2 9 456
ab1 12 345
nh7 0 0
gh67 6 987