awk to filter lines in a file based on match and conditional of another file - awk

I have a file with this format:
file1
id1 12.4
id2 21.6
id4 17.3
id6 95.5
id7 328.6
And I want to filter it based on another file with the format:
file2
id1 11.5
id2 10.4
id3 58.4
id4 24.6
id5 234.4
id6 2.5
id7 330.6
First, I would like to match ids between files. Then, I want to keep the lines in file1 in which the score (second column) is greater than the score in file2. It would output this:
id1 12.4
id2 21.6
id6 95.5
I started writing the code like awk 'FNR==NR { a[$1][$2][$0]; next } $1 in a {}' file1 file2 which I think would match the ids between files, but I don't know how to complete the code to filter by the scores.

You could write the awk command first reading file2, and then keep track of the values by setting the value a[$1] = 0+$2 and add zero for a numeric comparison
Then you can do the comparison with file1
awk 'FNR==NR { a[$1] = $2+0; next } $1 in a && $2+0 > a[$1]' file2 file1
Output
id1 12.4
id2 21.6
id6 95.5

Related

print consecutive lines conditional on two fields and substract another fied

I would like to print consecutive lines if they have matching first field but opposite signal in third field. Then compute the distance between the second fields of consecutive lines.
Input:
id1 pos1 0.19
id1 pos2 0.33
id1 pos3 -0.25
id1 pos4 -0.22
id2 pos5 0.33
id3 pos6 -0.21
id3 pos7 -0.56
id3 pos8 -0.20
id3 pos9 0.33
id3 pos10 -0.32
Intermediate output:
id1 pos2 0.33
id1 pos3 -0.25
id3 pos8 -0.20
id3 pos9 0.33
id3 pos10 -0.32
Desired output:
id1 pos3-pos2
id3 pos9-pos8
id3 pos10-pos9
I found similar questions comparing consecutive lines but none can be applied to answer my question.
So far I tried:
awk '$1==prev1{$NF=$2-prev2;print $1,$NF} {prev2=$2;prev1=$1}'
But I do not know how to add the condition of third field must have opposite signal.
Could you please try following.
awk '
prev!=$1{
prev_val=prev=""
}
prev==$1{
if(($NF~/^-/ && prev_val!~/^-/) || ($NF!~/^-/ && prev_val~/^-/)){
print $1,$2,$NF-prev_val
}
}
{
prev=$1
prev_val=$NF
}
' Input_file
From your description this awk should do:
awk '{sc=$3~/^-/?0:1} $1==p1&&sp!=sc {print $1,($3-p3)} {sp=sc;p1=$1;p3=$3}' file
id1 -0.58
id3 0.53
id3 -0.65
sc=$3~/^-/?0:1 test if value is positive 1 or negative 0
$1==p1&&sp!=sc If current ID is equal previous ID and value change sign,
print $1,($3-p3) print ID and differential between current and previous value.
sp=sc;p1=$1;p3=$3 set previous: sp to sc, p1 to $1 and p3 to $3
awk 'prev1!=$1{
prev3=prev2=prev1=""
}
prev1==$1{
if(($3~/^-/ && prev3!~/^-/) || ($3!~/^-/ && prev3~/^-/)){
print $1,$2-prev2
}
}
{
prev1=$1
prev2=$2
prev3=$3
}
' Input
This is the answer to my question. Thanks to all for helping me.

Query the contents of a file using another file in AWK

I am trying to conditionally filter a file based on values in a second file. File1 contains numbers and File2 contains two columns of numbers. The question is to filter out those rows in file1 which fall within the range denoted in each row of file2.
I have a series of loops which works, but takes >12hrs to run depending on the lengths of both files. This code is noted below. Alternatively, I have tried to use awk, and looked at other questions posted on slack overflow, but I cannot figure out how to change the code appropriately.
Loop method:
while IFS= read READ
do
position=$(echo $READ | awk '{print $4}')
while IFS= read BED
do
St=$(echo $BED | awk '{print $2}')
En=$(echo $BED | awk '{print $3}')
if (($position < "$St"))
then
break
else
if (($position >= "$St" && $position <= "$En"));
then
echo "$READ" | awk '{print $0"\t EXON"}' >> outputfile
fi
fi
done < file2
done < file1
Blogs with similar questions:
awk: filter a file with another file
awk 'NR==FNR{a[$1];next} !($2 in a)' d3_tmp FS="[ \t=]" m2p_tmp
Find content of one file from another file in UNIX
awk -v FS="[ =]" 'NR==FNR{rows[$1]++;next}(substr($NF,1,length($NF)-1) in rows)' File1 File2
file1: (tab delimited)
AAA BBB 1500
CCC DDD 2500
EEE FFF 2000
file2: (tab delimited)
GGG 1250 1750
HHH 1950 2300
III 2600 2700
Expected output would retain rows 1 and 3 from file1 (in a new file, file3) because these records fall within the ranges of row 1 columns 2 and 3, and row 2 columns 2 and columns 3 of file2. In the actual files, they're not row restricted i.e. I am not wanting to look at row1 of file1 and compare to row1 of file2, but compare row1 to all rows in file2 to get the hit.
file3 (output)
AAA BBB 1500
EEE FFF 2000
One way:
awk 'NR==FNR{a[i]=$2;b[i++]=$3;next}{for(j=0;j<i;j++){if ($3>=a[j] && $3<=b[j]){print;}}}' i=0 file2 file1
AAA BBB 1500
EEE FFF 2000
Read the file2 contents and store it in arrays a and b. When file1 is read, check for the number to be between the entire a and b arrays and print.
One more option:
$ awk 'NR==FNR{for(i=$2;i<=$3;i++)a[i];next}($3 in a)' file2 file1
AAA BBB 1500
EEE FFF 2000
File2 is read and the entire range of numbers is broken up and stored into the associate array a. When we read the file1, we just need to lookup the array a.
Another awk. It may or may not make sense depending on the filesizes:
$ awk '
NR==FNR {
a[$3]=$2 # hash file2 records, $3 is key, $2 value
next
}
{
for(i in a) # for each record in file1 go thru ever element in a
if($3<=i && $3>=a[i]) { # if it falls between
print # output
break # exit loop once match found
}
}' file2 file1
Output:
AAA BBB 1500
EEE FFF 2000

Extract rows in file where a column value is included in a list?

I have a huge file of data:
datatable.txt
id1 england male
id2 germany female
... ... ...
I have another list of data:
indexes.txt
id1
id3
id6
id10
id11
I want to extract all rows from datatable.txt where the id is included in indexes.txt.
Is it possible to do this with awk/sed/grep? The file sizes are so large using R or python is not convenient.
You just need a simple awk as
awk 'FNR==NR {a[$1]; next}; $1 in a' indexes.csv datatable.csv
id1 england male
FNR==NR{a[$1];next} will process on indexes.csv storing the
entries of the array as the content of the first column till the end of
the file.
Now on datatable.csv, I can match those rows from the first file by doing
$1 in a which will give me all those rows in current file whose
column $1's value a[$1] is same as in other file.
maybe i overlook something, but i build two test files:
a1:
id1
id2
id3
id6
id9
id10
and
a2:
id1 a 1
id2 b 2
id3 c 3
id4 c 4
id5 e 5
id6 f 6
id7 g 7
id8 h 8
id9 i 9
id10 j 10
with
join a1 a2 2> /dev/null
I get all lines matched by column one.

awk output first two columns then the minimum value out of the third and fourth columns

I have a tab delimited file like so:
col1 col2 col3 col4
a 5 y:3.2 z:5.1
b 7 r:4.1 t:2.2
c 8 e:9.1 u:3.2
d 10 o:5.2 w:1.1
For each row, I want to output the values in the first and second columns, and the smallest number out of the two values in the third and fourth columns.
col1 col2 min
a 5 3.2
b 7 2.2
c 8 3.2
d 10 1.1
My poor attempt:
awk -F'\t' '{min = ($3 < $4) ? $3 : $4; print $1, $2, min}'
One reason it's incorrect is because the values in the third and fourth columns aren't numbers but strings.
I don't know how to extract the number out of the third and fourth columns, the number is always after the colon..
awk to the rescue!
$ awk -F'[ *:]' 'NR==1{print $1,$2,"min";next} {print $1,$2, $4<$6?$4:$6}' file
col1 col2 min
a 5 3.2
b 7 2.2
c 8 3.2
d 10 1.1

Comparing two lists and printing select columns from each list

I want to compare two lists and print some columns from one, and some from the other if two match. I suspect I'm close but I suppose it's better to check..
1st file: Data.txt
101 0.123
145 0.119
242 0.4
500 0.88
2nd File: Map.txt
red 1 99
blue 3 101
rob 3 240
ted 7 500
So, if I want to compare the 3rd column of file2 against the 1st of file1 and print the 1st column of file2 and all of file1, I tried awk 'NR==FNR {a[$3];next}$1 in a{print$0}' file2 file1
but that only prints matches in file1. I tried adding x=$1 in the awk. i.e. awk 'NR==FNR {x=$1;a[$3];next}$1 in a{print x $0} file2 file1 but that saves only one value of $1 and outputs that value every line. I also tried adding $1 into a[$3], which is obviously wrong thus giving zero output.
Ideally I'd like to get this output:
blue 145 0.119
ted 500 0.88
which is the 1st column of file2 and the 3rd column of file2 matched to 1st column of file1, and the rest of file1.
You had it almost exactly in your second attempt. Just instead of assigning the value of $1 to a scalar you can stash it in the array for later use.
awk 'NR==FNR {a[$3]=$1; next} $1 in a {print a[$1], $0}' file2.txt file1.txt
$ cat file1.txt
101 0.123
145 0.119
242 0.4
500 0.88
$ cat file2.txt
red 1 99
blue 3 101
rob 3 240
ted 7 500
$ awk 'NR==FNR {a[$3]=$1; next} $1 in a {print a[$1], $0}' file2.txt file1.txt
blue 101 0.123
ted 500 0.88