matching columns one separate files and appending matches to file - awk

I'm trying to merge two files filtered on a single column using awk. What I'd then like to do is append the relevant columns from file2 into file 1.
Easier to explain with dummy example.
File1
name fruit animal
bob apple dog
jim orange cat
gary mango snake
daisy peach mouse
File 2:
animal number shape
cat eight square
dog nine circle
mouse eleven sphere
Desired output:
name fruit animal shape
bob apple dog circle
jim orange cat square
gary mango snake NA
daisy peach mouse sphere
Step 1: Need to filter on column 3 in file1 and column 1 in file2
awk -F'\t' 'NR==FNR{c[$3]++;next};c[$1] > 0' file1 file2
This gives me output:
cat eight square
dog nine circle
mouse eleven sphere
This helps me somewhat, however I can't simply cut the third column (shape) from the output above and append it to to file1 since there is no entry for 'snake' in file2. I need to be able to append column 3 of output to file 1 where a match is successful, and where it is not to put 'NA'. It's essential that all the lines in file1 are retained so I can't just omit them. This is where I'm stuck!
I'd appreciate any help please....
E

Could you please try following, written and tested based on shown samples in GNU awk.
awk '
BEGIN{
OFS="\t"
}
FNR==NR{
a[$1]=$NF
next
}
{
print $0,($3 in a?a[$3]:"NA")
}' Input_file2 Input_file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
OFS="\t" ##Setting TAB as output field separator here.
}
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file file2 is being read.
a[$1]=$NF ##Creating array a with index $1 and value is $NF for current line.
next ##next will skip all further statements from here.
}
{
print $0,($3 in a?a[$3]:"NA") ##Printing current line and checking if 3rd field is present in array a then print its value OR print NA.
}' file2 file1 ##Mentioning Input_file names here.

Related

Printing lines of file2 when two fields from file1 match substrings of a single field in file2

Goal: To print lines of File2 when field 1 ($1) and field 4 ($4) of File1 both match a substring in field 4 ($4) on lines beginning with ">" in File2.
Important note #1: The lines being printed to output include the line being searched and all the lines following it until the next line with a ">".
Example: When fields 1 and 4 of File1 are 2776 & 2968 respectively, these should be searched against field 4 of File2 to evntually find the match 2776-2968(+) (because both numbers of File1 match a substring in field 4 of File2). The order of the numbers in the string does not matter - 2968-2776(+) should also be considered a match. Since they match, that line of File2 is printed with all lines below it until another line with ">" is encountered.
Important Note #2: File1 is tab-delimited: \t. File 2 is colon-delimited: :.
File1:
Transcription_Start Translation_Start Translation_Stop Transcription_Stop Strand Expression
2776 2968 + 920
17374 17563 + 1959
2968 2786 - 802
17563 17375 - 1694
19606 19395 - 1914
File2:
>-::NC_013316.1:2776-2968(+)
ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC
TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC
GCTCGGAAACGGACGCTAATACCGCATAC
>-::NC_013316.1:17374-17563(+)
AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA
>-::NC_013316.1:2786-2968(-)
GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC
>antisense_CDR20291_r27::NC_013316.1:10830-11707(-)
TATTTCTTGTTCCTTTTTTCAAGGACAAGTAAATAAATTAACCTACTGTTTAATTTTCAA
>antisense_CDR20291_r27::NC_013316.1:11814-11874(-)
TTCCTTTGAGTTTCACTCTTGCGAGCGTACTTCCCAGGCGGA
Desired Output:
>-::NC_013316.1:2776-2968(+)
ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC
TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC
GCTCGGAAACGGACGCTAATACCGCATAC
>-::NC_013316.1:17374-17563(+)
AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA
>-::NC_013316.1:2786-2968(-)
GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC
This is what I've tried so far (it outputs the full contents of File2, thus failing to produce the desired output):
$ awk -F"\t|:" 'NR==FNR{a[$4]; next} ($1 in a) || ($4 in a)' File1 File2 > Output
>-::NC_013316.1:2776-2968(+)
ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC
TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC
GCTCGGAAACGGACGCTAATACCGCATAC
>-::NC_013316.1:17374-17563(+)
AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA
>-::NC_013316.1:2786-2968(-)
GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC
>antisense_CDR20291_r27::NC_013316.1:10830-11707(-)
TATTTCTTGTTCCTTTTTTCAAGGACAAGTAAATAAATTAACCTACTGTTTAATTTTCAA
>antisense_CDR20291_r27::NC_013316.1:11814-11874(-)
TTCCTTTGAGTTTCACTCTTGCGAGCGTACTTCCCAGGCGGA
How can I process my files with awk (or similar) to achieve my goal?
With your shown samples, please try following. Written and tested with GNU awk.
awk '
FNR==NR{
arr[$1,$2]
next
}
/^>/{
found=""
if((($5,$6) in arr) || (($6,$5) in arr)){
found=1
}
}
found
' file1 FS=":|-|\\\\(" file2
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when file1 is being read.
arr[$1,$2] ##Creating arr with index of 1st and 2nd field.
next ##next will skip all further statements from here.
}
/^>/{ ##Checking condition if line starts from > then do following.
found="" ##Nullifying found here.
if((($5,$6) in arr) || (($6,$5) in arr)){ ##Checking condition if either 5th 6th field is present in arr OR 6th 5th field as a key present in arr then do following.
found=1 ##Setting found to 1 here.
}
}
found ##Checking condition if found is set then print that line.
' file1 FS=":|-|\\\\(" file2 ##Mentioning Input_file(s) and setting field separator before Input_file2 to get exact values to match.

Not extracting data between two patterns

I have tried this awk command, but for some reason it is not printing out the data between two patterns
This is my entire awk command
for file in `cat out.txt`
do
awk -v ff="$file" 'BEGIN {print "Start Parsing for"ff} /ff-START/{flag=1; next}/ff-END/{flag=0}flag; END{print "End Parsing"ff}' data.txt
done
This is the content of data.txt
JOHN SMITH-START
Device,Number
TV,1
Washing Machine,1
Phones, 5
JOHN SMITH-END
MARY JOE-START
Device,Number
TV,3
Washing Machine,1
Phones, 2
MARY JOE-END
and there are 100 more similar lines here the patterns is NAME-START and NAME-END. So for eg JOHN SMITH-START is the first pattern and then JOHN SMITH-END is the second pattern, and I want to extract the data between these two which is
Device,Number
TV,1
Washing Machine,1
Phones, 5
But the output I get is
Start Parsing forJOHN SMITH
End ParsingJOHN SMITH
Content of out.txt is
JOHN SMITH
MARY JOE
With your shown samples, could you please try following.
awk '/JOHN SMITH-END/ && found{exit} /JOHN SMITH-START/{found=1;next} found' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/JOHN SMITH-END/ && found{ ##Checking condition if line contains JOHN SMITH-END and found is SET then do following.
exit ##exiting from program from here.
}
/JOHN SMITH-START/{ ##Checking condition if line contains JOHN SMITH-START then do following.
found=1 ##Setting found to 1 here.
next ##next will skip all further statements from here.
}
found ##If found is set then print that line.
' Input_file ##Mentioning Input_file name here.
NOTE: In case you want to use variables in awk for search then try following:
awk -v start="JOHN SMITH-START" -v end="JOHN SMITH-END" '$0 ~ end && found{exit} $0 ~ start{found=1;next} found' Input_file

shell awk script to remove duplicate lines

I am trying to remove duplicate lines from a file including the original ones but the following command that I am trying is sorting the lines and I want them to be in the same order as they are in input file.
awk '{++a[$0]}END{for(i in a) if (a[i]==1) print i}' test.txt
Input:
123
aaa
456
123
aaa
888
bbb
Output I want:
456
888
bbb
Simpler code if you are okay with reading input file twice:
$ awk 'NR==FNR{a[$0]++; next} a[$0]==1' ip.txt ip.txt
456
888
bbb
With single pass:
$ awk '{a[NR]=$0; b[$0]++} END{for(i=1;i<=NR;i++) if(b[a[i]]==1) print a[i]}' ip.txt
456
888
bbb
If you want to do this in awk only then could you please try following; if not worried about order.
awk '{a[$0]++};END{for(i in a){if(a[i]==1){print i}}}' Input_file
To get the unique values in same order in which they occur in Input_file try following.
awk '
!a[$0]++{
b[++count]=$0
}
{
c[$0]++
}
END{
for(i=1;i<=count;i++){
if(c[b[i]]==1){
print b[i]
}
}
}
' Input_file
Output will be as follows.
456
888
bbb
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
!a[$0]++{ ##Checking condition if current line is NOT occur in array a with more than 1 occurrence then do following.
b[++count]=$0 ##Creating an array b with index count whose value is increasing with 1 and its value is current line value.
}
{
c[$0]++ ##Creating an array c whose index is current line and its value is occurrence of current lines.
}
END{ ##Starting END block for this awk program here.
for(i=1;i<=count;i++){ ##Starting for loop from here.
if(c[b[i]]==1){ ##Checking condition if value of array c with index is value of array b with index i equals to 1 then do following.
print b[i] ##Printing value of array b.
}
}
}
' Input_file ##Mentioning Input_file name here.
awk '{ b[$0]++; a[n++]=$0; }END{ for (i in a){ if(b[a[i]]==1) print a[i] }}' input
Lines are added to array b, the order of lines is kept in array a.
If, in the end, the count is 1, the line is printed.
Sorry, i misread the question at first, and i corrected the answer, to be almost the same as #Sundeep ...

bash compare two columns with exact match

I am comparing columns between two files for exact match but I am ending up with inaccurate result. Example as follows.
File1 File2
adam sunny
jhon adam
kelly adam
matt kevin
stuart adam
Gary Gary
When we look at the files there is only match i.e. Garry. My output should be following.
Emptyline
Emptyline
Emptyline
Emptyline
Emptyline
Gary
In order to achieve requirement. I am running the following command
awk 'NR==FNR { n[$1]=$0;next } ($1 in n) { print n[$1],$2 }' file1 file2
and I am getting output as follows
adam
adam
adam
Garry
You should be tracking line numbers, not just line contents:
$ awk 'NR==FNR { lines[NR]=$0; next }
{ if ($0 == lines[FNR]) print; else print "" }' file1.txt file2.txt
Gary
1st solution: With simple awk.
awk 'FNR==NR{a[FNR]=$0;next} a[FNR]==$0{print;next} {print ""}' file1 file2
OR as per anubhava sir's comment:
awk 'FNR==NR{a[FNR]=$0;next} a[FNR]!=$0{$0=""} 1' file1 file2
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first file Input_file1 is being read.
a[FNR]=$0 ##Creating an array a with index FNR and value of current line here.
next ##next will skip all further statements from here.
}
a[FNR]==$0{ ##Checking condition if value of array a with FNR index and current line is equal then do following.
print $0,a[FNR] ##Printing current line and value array a with index FNR here.
}
' file1 file2 ##Mentioning Input_file names here
2nd solution: Considering that your actual Input_file(s) have only 2 columns as per shown samples, could you please try following then.
paste Input_file1 Input_file2 | awk '$1==$2{print $1};$1!=$2{print ""}'
This code will only print lines whose values are equal in Input_file1 and Input_file2.

Understanding two file processing in awk

I am trying to understand how two file processing works. So here created an example.
file1.txt
zzz pq Fruit Apple 10
zzz rs Fruit Car 50
zzz tu Study Book 60
file2.txt
aa bb Book 100
cc dd Car 200
hj kl XYZ 500
ee ff Apple 300
ff gh ABC 400
I want to compare 4th column of file1 to 3rd column of file2, if matched then print the 3rd,4th,5th column of file1 followed by 3rd, 4th column of file2 with sum of 5th column of file1 and 4th column of file2.
Expected Output:
Fruit Apple 10 300 310
Fruit Car 50 200 250
Study Book 60 100 160
Here what I have tried:
awk ' FNR==NR{ a[$4]=$5;next} ( $3 in a){ print $3, a[$4],$4}' file1.txt file2.txt
Code output;
Book 100
Car 200
Apple 300
I am facing problem in printing file1 column and how to store the other column of file1 in array a. Please guide me.
Could you please try following.
awk 'FNR==NR{a[$4]=$3 OFS $4 OFS $5;b[$4]=$NF;next} ($3 in a){print a[$3],$NF,b[$3]+$NF}' file1.txt file2.txt
Output will be as follows.
Study Book 60 100 160
Fruit Car 50 200 250
Fruit Apple 10 300 310
Explanation: Adding explanation for above code now.
awk ' ##Starting awk program here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file named file1.txt is being read.
a[$4]=$3 OFS $4 OFS $5 ##Creating an array named a whose index is $4 and value is 3rd, 4th and 5th fields along with spaces(By default OFS value will be space for awk).
b[$4]=$NF ##Creating an array named b whose index is $4 and value if $NF(last field of current line).
next ##next keyword will skip all further lines from here.
}
($3 in a){ ##Checking if 3rd field of current line(from file2.txt) is present in array a then do following.
print a[$3],$NF,b[$3]+$NF ##Printing array a whose index is $3, last column value of current line and then SUM of array b with index $3 and last column value here.
}
' file1.txt file2.txt ##Mentioning Input_file names file1.txt and file2.txt