how to compare the columns in two files using awk - awk

I want to compare two files using awk command, with the File 1 and 2 containing the following information. The File 1 is the nucleotide positions as can be seen in the column 2 of File 2.
Now I need an awk command to compare the column (only one cloumn is present) in File 1 to the column 2 of File 2 and if a match is found, print the whole line in File 2 to File 3
File 1
7113528
8680847
File 2
chromosome01 6765006 . C T 22.8 . DP=3;RPB=-8.745357e-01;AF1=1;AC1=2;DP4=1,1,1,0;MQ=35;FQ=-27;PV4=1,1,1,1;EFF=DOWNSTREAM(MODIFIER||1505||492|OS01G0223600|protein_coding|CODING|OS01T0223600-01||1|WARNING_TRANSCRIPT_NO_START_CODON),DOWNSTREAM(MODIFIER||262||174|OS01G0223500|protein_coding|CODING|OS01T0223500-00||1),INTERGENIC(MODIFIER||||||||||1) PL 51,0,0
chromosome01 6765043 . C T 22.8 . DP=3;RPB=-8.745357e-01;AF1=1;AC1=2;DP4=1,1,1,0;MQ=35;FQ=-27;PV4=1,1,1,1;EFF=DOWNSTREAM(MODIFIER||1468||492|OS01G0223600|protein_coding|CODING|OS01T0223600-01||1|WARNING_TRANSCRIPT_NO_START_CODON),DOWNSTREAM(MODIFIER||225||174|OS01G0223500|protein_coding|CODING|OS01T0223500-00||1),INTERGENIC(MODIFIER||||||||||1) PL 51,0,0
chromosome01 7113528 . GACAC GAC 7.98 . INDEL;IS=1,0.333333;DP=3;VDB=6.186179e-02;AF1=1;AC1=2;DP4=1,1,0,1;MQ=35;FQ=-34.5;PV4=1,1,1,1;EFF=DOWNSTREAM(MODIFIER||2254||107|OS01G0228901|protein_coding|CODING|OS01T0228901-01||1),DOWNSTREAM(MODIFIER||3930|||NCRNA_20319|ncRNA|NON_CODING|NCRNA_20319||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||3930|||NCRNA_21253|ncRNA|NON_CODING|NCRNA_21253||1),UPSTREAM(MODIFIER||4436||687|OS01G0228800|protein_coding|CODING|OS01T0228800-01||1) PL 43,0,0
chromosome01 7113583 . C T 22.8 . DP=3;RPB=-8.745357e-01;AF1=1;AC1=2;DP4=1,1,0,1;MQ=35;FQ=-27;PV4=1,1,1,1;EFF=DOWNSTREAM(MODIFIER||2202||107|OS01G0228901|protein_coding|CODING|OS01T0228901-01||1),DOWNSTREAM(MODIFIER||3982|||NCRNA_20319|ncRNA|NON_CODING|NCRNA_20319||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||3982|||NCRNA_21253|ncRNA|NON_CODING|NCRNA_21253||1),UPSTREAM(MODIFIER||4488||687|OS01G0228800|protein_coding|CODING|OS01T0228800-01||1) PL 51,0,0
chromosome01 7427540 . C T 22.8 . DP=3;RPB=8.745357e-01;AF1=1;AC1=2;DP4=0,2,0,1;MQ=35;FQ=-27;PV4=1,1,1,1;EFF=DOWNSTREAM(MODIFIER||1091|||NCRNA_19787|ncRNA|NON_CODING|NCRNA_19787||1),DOWNSTREAM(MODIFIER||1113|||NCRNA_7056|ncRNA|NON_CODING|NCRNA_7056||1),DOWNSTREAM(MODIFIER||2841||256|OS01G0234433|protein_coding|CODING|OS01T0234433-00||1|WARNING_TRANSCRIPT_NO_START_CODON),DOWNSTREAM(MODIFIER||4859|||NCRNA_25306|ncRNA|NON_CODING|NCRNA_25306||1),NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Cca/Aca|P35T|421|OS01G0234200|protein_coding|CODING|OS01T0234200-00|1|1|WARNING_REF_DOES_NOT_MATCH_GENOME),UPSTREAM(MODIFIER||1091|||NCRNA_19719|ncRNA|NON_CODING|NCRNA_19719||1),UPSTREAM(MODIFIER||1113|||NCRNA_7253|ncRNA|NON_CODING|NCRNA_7253||1),UPSTREAM(MODIFIER||1844||386|OS01G0234300|protein_coding|CODING|OS01T0234300-00||1),UPSTREAM(MODIFIER||2862|||NCRNA_9648|ncRNA|NON_CODING|NCRNA_9648||1),UPSTREAM(MODIFIER||3028||255|OS01G0234499|protein_coding|CODING|OS01T0234499-00||1),UPSTREAM(MODIFIER||4863|||NCRNA_27966|ncRNA|NON_CODING|NCRNA_27966||1),UPSTREAM(MODIFIER||4872|||NCRNA_33984|ncRNA|NON_CODING|NCRNA_33984||1) PL 51,0,0
chromosome01 7427583 . T C 26.1 . DP=3;RPB=-9.668049e-01;AF1=1;AC1=2;DP4=0,1,0,1;MQ=42;FQ=-28;PV4=1,1,1,1;EFF=DOWNSTREAM(MODIFIER||1134|||NCRNA_19787|ncRNA|NON_CODING|NCRNA_19787||1),DOWNSTREAM(MODIFIER||1156|||NCRNA_7056|ncRNA|NON_CODING|NCRNA_7056||1),DOWNSTREAM(MODIFIER||2798||256|OS01G0234433|protein_coding|CODING|OS01T0234433-00||1|WARNING_TRANSCRIPT_NO_START_CODON),DOWNSTREAM(MODIFIER||4902|||NCRNA_25306|ncRNA|NON_CODING|NCRNA_25306||1),SYNONYMOUS_CODING(LOW|SILENT|ggC/ggG|G20|421|OS01G0234200|protein_coding|CODING|OS01T0234200-00|1|1|WARNING_REF_DOES_NOT_MATCH_GENOME),UPSTREAM(MODIFIER||1134|||NCRNA_19719|ncRNA|NON_CODING|NCRNA_19719||1),UPSTREAM(MODIFIER||1156|||NCRNA_7253|ncRNA|NON_CODING|NCRNA_7253||1),UPSTREAM(MODIFIER||1801||386|OS01G0234300|protein_coding|CODING|OS01T0234300-00||1),UPSTREAM(MODIFIER||2905|||NCRNA_9648|ncRNA|NON_CODING|NCRNA_9648||1),UPSTREAM(MODIFIER||2985||255|OS01G0234499|protein_coding|CODING|OS01T0234499-00||1),UPSTREAM(MODIFIER||4906|||NCRNA_27966|ncRNA|NON_CODING|NCRNA_27966||1),UPSTREAM(MODIFIER||4915|||NCRNA_33984|ncRNA|NON_CODING|NCRNA_33984||1) PL 55,1,0

You can use this awk:
awk 'FNR==NR {a[$1]; next} $2 in a' f1 f2
To redirect to another file:
awk 'FNR==NR {a[$1]; next} $2 in a' f1 f2 > f3
Explanation
FNR==NR {a[$1]; next} loop through the first file storing the values in the array a[].
$2 in a if 2nd column of 2nd file is present in the array a[], then this is true and the full line is printed.
Test
$ awk 'FNR==NR {a[$1]; next} $2 in a' f1 f2
chromosome01 7113528 . GACAC GAC 7.98 . INDEL;IS=1,0.333333;DP=3;VDB=6.186179e-02;AF1=1;AC1=2;DP4=1,1,0,1;MQ=35;FQ=-34.5;PV4=1,1,1,1;EFF=DOWNSTREAM(MODIFIER||2254||107|OS01G0228901|protein_coding|CODING|OS01T0228901-01||1),DOWNSTREAM(MODIFIER||3930|||NCRNA_20319|ncRNA|NON_CODING|NCRNA_20319||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||3930|||NCRNA_21253|ncRNA|NON_CODING|NCRNA_21253||1),UPSTREAM(MODIFIER||4436||687|OS01G0228800|protein_coding|CODING|OS01T0228800-01||1) PL 43,0,0

You can use grep:
grep -f file1 file2 > outputfile
The -f option tells grep to read the patterns from a file, one per line.
Note: Thanks to #fedorqui for pointing out that there can be problems if one of the patterns in file1 appears in another column in file2.

Related

Sort a file preserving the header as first position with bash

When sorting a file, I am not preserving the header in its position:
file_1.tsv
Gene Number
a 3
u 7
b 9
sort -k1,1 file_1.tsv
Result:
a 3
b 9
Gene Number
u 7
So I am tryig this code:
sed '1d' file_1.tsv | sort -k1,1 > file_1_sorted.tsv
first='head -1 file_1.tsv'
sed '1 "$first"' file_1_sorted.tsv
What I did is to remove the header and sort the rest of the file, and then trying to add again the header. But I am not able to perform this last part, so I would like to know how can I copy the header of the original file and insert it as the first row of the new file without substituting its actuall first row.
You can do this as well :
{ head -1; sort; } < file_1.tsv
** Update **
For macos :
{ IFS= read -r header; printf '%s\n' "$header" ; sort; } < file_1.tsv
a simpler awk
$ awk 'NR==1{print; next} {print | "sort"}' file
$ head -1 file; tail -n +2 file | sort
Output:
Gene Number
a 3
b 9
u 7
Could you please try following.
awk '
FNR==1{
first=$0
next
}
{
val=(val?val ORS:"")$0
}
END{
print first
print val | "sort"
}
' Input_file
Logical explanation:
Check condition FNR==1 to see if its first line; then save its values to variable and move on to next line by next.
Then keep appending all lines values to another variable with new line till last line.
Now come to END block of this code which executes when Input_file is done being read, there print first line value and put sort command on rest of the lines value there.
This will work using any awk, sort, and cut in any shell on every UNIX box and will work whether the input is coming from a pipe (when you can't read it twice) or from a file (when you can) and doesn't involve awk spawning a subshell:
awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2 | cut -f2-
The above uses awk to stick a 0 at the front of the header line and a 1 in front of the rest so you can sort by that number then whatever other field(s) you want to sort on and then remove the added field again with a cut. Here it is in stages:
$ awk -v OFS='\t' '{print (NR>1), $0}' file
0 Gene Number
1 a 3
1 u 7
1 b 9
$ awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2
0 Gene Number
1 a 3
1 b 9
1 u 7
$ awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2 | cut -f2-
Gene Number
a 3
b 9
u 7

How to join two files based on one column in AWK (using wildcards)

I have 2 files, and I need to compare column 2 from file 2 with column 3 from file 1.
File 1
"testserver1","testserver1.domain.net","-1.1.1.1-10.10.10.10-"
"testserver2","testserver2.domain.net","-2.2.2.2-20.20.20.20-200.200.200.200-"
"testserver3","testserver3.domain.net","-3.3.3.3-"
File 2
"windows","10.10.10.10","datacenter1"
"linux","2.2.2.2","datacenter2"
"aix","4.4.4.4","datacenter2"
Expected Output
"testserver1","testserver1.domain.net","windows","10.10.10.10","datacenter1"
"testserver2","testserver2.domain.net","linux","2.2.2.2","datacenter2"
All I have been able to find statements that only work if the columns are identical, I need it to work if column 3 from file 1 contains value from column 2 from file 2
I've tried this, but again, it only works if the columns are identical (which I don't want):
awk 'BEGIN {FS = OFS = ","};NR == FNR{f[$3] = $1","$2;next};$2 in f {print f[$2],$0}' file1.csv file2.csv
hacky!
$ awk -F'","' 'NR==FNR {n=split($NF,x,"-"); for(i=2;i<n;i++) a[x[i]]=$1 FS $2; next}
$2 in a {print a[$2] "\"," $0}' file1 file2
"testserver1","testserver1.domain.net","windows","10.10.10.10","datacenter1"
"testserver2","testserver2.domain.net","linux","2.2.2.2","datacenter2"
assumes the lookup is unique, i.e. file1 records are mutually exclusive in that field.

awk to skip lines up to and including pattern [duplicate]

This question already has answers here:
Printing with sed or awk a line following a matching pattern
(9 answers)
Closed 5 years ago.
I am trying to use awk to skip all lines including a specific pattern /^#CHROM/ and start processing on the line below. The awk does execute but currently returns all lines in the tab-delimited file. Thank you :).
file
##INFO=<ID=ANN,Number=1,Type=Integer,Description="My custom annotation">
##source_20170530.1=vcf-annotate(r953) -d key=INFO,ID=ANN,Number=1,Type=Integer,Description=My custom annotation -c CHROM,FROM,TO,INFO/ANN
##INFO=<ID=,Number=A,Type=Float,Description="Variant quality">
#CHROM POS ID REF ALT
chr1 948846 . T TA NA NA
chr2 948852 . T TA NA NA
chr3 948888 . T TA NA NA
awk
awk -F'\t' -v OFS="\t" 'NR>/^#CHROM/ {print $1,$2,$3,$4,$5,"ID=1"$6,"ID=2"$7}' file
desiered output
chr1 948846 . T TA ID1=NA ID2=NA
chr2 948852 . T TA ID1=NA ID2=NA
chr3 948888 . T TA ID1=NA ID2=NA
awk 'BEGIN{FS=OFS="\t"} f{print $1,$2,$3,$4,$5,"ID1="$6,"ID2="$7} /^#CHROM/{f=1}' file
See https://stackoverflow.com/a/17914105/1745001 for details on this and other awk search idioms. Yours is a variant of "b" on that page.
Use the following awk approach:
awk -v OFS="\t" '/^#CHROM/{ r=NR }r && NR>r{ $6="ID=1"$6; $7="ID=2"$7; print }' file
The output:
chr1 948846 . T TA ID=1NA ID=2NA
chr2 948852 . T TA ID=1NA ID=2NA
chr3 948888 . T TA ID=1NA ID=2NA
/^#CHROM/{ r=NR } - capturing the pattern line number
The alternative approach would look as below:
awk -v OFS="\t" '/^#CHROM/{ f=1; next }f{ $6="ID=1"$6; $7="ID=2"$7; print }' file

How to find the difference between two files using multiple conditions?

I have two files file1.txt and file2.txt like below -
cat file1.txt
2016-07-20-22 4343250019 1003116 001 data45343 25-JUL-16 11-MAR-16 1 N 0 0 N
2016-06-20-22 654650018 1003116 001 data45343 25-JUL-17 11-MAR-16 1 N 0 0 N
cat file2.txt
2016-07-20-22|9|1003116|001|data45343|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one
2016-06-20-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|hi|this|kill
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Requirement is to fetch the records which are not available in
file1.txt using below condition.
file1.txt file2.txt
col1(date) col1(Date)
col2(number: 4343250019 ) col2(last value of number: 9)
col3(number) col3(number)
col5(alphanumeric) col5(alphanumeric)
Expected Output :
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|NULL|0|0|N|kill|boll|one
This output line doesn't available in file1.txt but available in
file2.txt after satisfying the matching criteria.
I was trying below steps to achieve this output -
###Replacing the space/tab from the file1.txt with pipe
awk '{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' OFS="|" file1.txt > file1.txt1
### Looping on a combination of four column of file1.txt1 with combination of modified column of file2.txt and output in output.txt
awk 'BEGIN{FS=OFS="|"} {a[$1FS$2FS$3FS$5];next} {(($1 FS substr($2,length($2),1) FS $3 FS $5) in a) print $0}' file2.txt file1.txt1 > output.txt
###And finally, replace the "N" from column 8th and put "NULL" if the value is "N".
awk -F'|' '{ gsub ("N","NULL",$8);print}' OFS="|" output.txt > output.txt1
What is the issue?
My 2nd operation is not working and I am trying to put all 3 operations in one operation.
awk -F'[|]|[[:blank:]]+' 'FNR==NR{E[$1($2%10)$3$5]++;next}!($1$2$3$5 in E)' file1.txt file2.txt
and your sample output is wrong, it should be (last field if different: data45333)
2016-07-20-22|9|1003116|001|data45333|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one
2017-06-22-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Commented code
# separator for both file first with blank, second with `|`
awk -F'[|]|[[:blank:]]+' '
# for first file
FNR==NR{
# create en index entry based on the 4 field. The forat of filed allow to use them directly without separator (univoq)
E[ $1 ( $2 % 10 ) $3 $5 ]++
# for this line (file) don't go further
next
}
# for next file lines
# if not in the index list of entry, print the line (default action)
! ( ( $1 $2 $3 $5 ) in E ) { print }
' file1.txt file2.txt
Input
$ cat f1
2016-07-20-22 4343250019 1003116 001 data45343 25-JUL-16 11-MAR-16 1 N 0 0 N
2016-06-20-22 654650018 1003116 001 data45343 25-JUL-17 11-MAR-16 1 N 0 0 N
$ cat f2
2016-07-20-22|9|1003116|001|data45343|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one
2016-06-20-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|hi|this|kill
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Output
$ awk 'FNR==NR{a[$1,substr($2,length($2)),$3,$5];next}!(($1,$2,$3,$5) in a)' f1 FS="|" f2
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Explanation
awk ' # call awk.
FNR==NR{ # This is true when awk reads first file
a[$1,substr($2,length($2)),$3,$5] # array a where index being $1(field1), last char from $2, $3 and $5
next # stop processing go to next line
}
!(($1,$2,$3,$5) in a) # here we check index $1,$2,$3,$5 exists in array a by reading file f2
' f1 FS="|" f2 # Read f1 then
# set FS and then read f2
FNR==NR If the number of records read so far in the current file
is equal to the number of records read so far across all files,
condition which can only be true for the first file read.
a[$1,substr($2,length($2)),$3,$5] populate array "a" such that the
indexed by the first
field, last char of second field, third field and fifth field from
current record of file1
next Move on to the next record so we don't do any processing
intended for records from the second file.
!(($1,$2,$3,$5) in a) IF the array a index constructed from the
fields ($1,$2,$3,$5) of the current record of file2 does not exist
in array a, we get boolean true (! Called Logical NOT Operator. It is used to reverse the logical state of its operand. If a condition is true, then Logical NOT operator will make it false and vice versa.) so awk does default operation print $0 from file2
f1 FS="|" f2 read file1(f1), set field separator "|" after
reading first file, and then read file2(f2)
--edit--
When filesize is huge around 60GB(900 millions rows), its not a good
idea to process the file two times. 3rd operation - (replace "N" with
"NULL" from col - 8 ""awk -F'|' '{ gsub ("N","NULL",$8);print}'
OFS="|" output.txt
$ awk 'FNR==NR{
a[$1,substr($2,length($2)),$3,$5];
next
}
!(($1,$2,$3,$5) in a){
sub(/N/,"NULL",$8);
print
}' f1 FS="|" OFS="|" f2
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|NULL|0|0|N|kill|boll|one
You can try this awk:
awk -F'[ |]*' 'NR==FNR{su=substr($2,length($2),1); a[$1":"su":"$3":"$5]=1;next} !a[$1":"$2":"$3":"$5]{print $0}' f1 f2
Here,
a[] - an associative array
$1":"su":"$3":"$5 - this forms key for an array index. su is last digit of field $2 (su=substr($2,length($2),1)). Then, assigning an 1 as value for this key.
NR==FNR{...;next} - this block works for processing f1.
Update:
awk 'NR==FNR{$2=substr($2,length($2),1); a[$1":"$2":"$3":"$5]=1;next} !a[$1":"$2":"$3":"$5]{gsub(/^N$/,"NULL",$8);print}' f1 FS="|" OFS='|' f2

Maintaining the separator in awk output

I would like to subset a file while I keep the separator in the subsetted output using ´awk´ in bash.
That´s what I am using:
The input file is created in R language with:
inp <- 'AX-1 1 125 AA 0.2 1 AB -0.89 0 AA 0.005 0.56
AX-2 2 456 AA 0 0 AA -0.56 0.56 AB -0.003 0
AX-3 3 3445 BB 1.2 1 NA 0.002 0 AA 0.005 0.55'
inp <- read.table(text=inp, header=F)
write.table(inp, "inp.txt", col.names=F, row.names=F, quote=F, sep="\t")
(So fields are separated by tabs)
The code in bash:
awk {'print $1 $2 $3'} inp.txt
The result:
AX-11125
AX-22456
AX-333445
Please note that my columns were merged in the awkoutput (and I would like it to be tab delimited as the input file). Probably it is a simple syntax problem, but I would be grateful to any ideas.
Use
awk -v OFS='\t' '{ print $1, $2, $3 }'
or
awk '{ print $1 "\t" $2 "\t" $3 }'
Written one after another without an operator between them, variables in awk are concatenated - $1 $2 $3 is no different from $1$2$3 in this respect.
The first solution sets the output field separator OFS to a tab, then uses the comma operator to print separated fields. The second solution simply sprinkles tabs in there directly, and everything is concatenated as it was before.