Printing lines of file2 when two fields from file1 match substrings of a single field in file2 - awk

Goal: To print lines of File2 when field 1 ($1) and field 4 ($4) of File1 both match a substring in field 4 ($4) on lines beginning with ">" in File2.
Important note #1: The lines being printed to output include the line being searched and all the lines following it until the next line with a ">".
Example: When fields 1 and 4 of File1 are 2776 & 2968 respectively, these should be searched against field 4 of File2 to evntually find the match 2776-2968(+) (because both numbers of File1 match a substring in field 4 of File2). The order of the numbers in the string does not matter - 2968-2776(+) should also be considered a match. Since they match, that line of File2 is printed with all lines below it until another line with ">" is encountered.
Important Note #2: File1 is tab-delimited: \t. File 2 is colon-delimited: :.
File1:
Transcription_Start Translation_Start Translation_Stop Transcription_Stop Strand Expression
2776 2968 + 920
17374 17563 + 1959
2968 2786 - 802
17563 17375 - 1694
19606 19395 - 1914
File2:
>-::NC_013316.1:2776-2968(+)
ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC
TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC
GCTCGGAAACGGACGCTAATACCGCATAC
>-::NC_013316.1:17374-17563(+)
AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA
>-::NC_013316.1:2786-2968(-)
GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC
>antisense_CDR20291_r27::NC_013316.1:10830-11707(-)
TATTTCTTGTTCCTTTTTTCAAGGACAAGTAAATAAATTAACCTACTGTTTAATTTTCAA
>antisense_CDR20291_r27::NC_013316.1:11814-11874(-)
TTCCTTTGAGTTTCACTCTTGCGAGCGTACTTCCCAGGCGGA
Desired Output:
>-::NC_013316.1:2776-2968(+)
ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC
TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC
GCTCGGAAACGGACGCTAATACCGCATAC
>-::NC_013316.1:17374-17563(+)
AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA
>-::NC_013316.1:2786-2968(-)
GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC
This is what I've tried so far (it outputs the full contents of File2, thus failing to produce the desired output):
$ awk -F"\t|:" 'NR==FNR{a[$4]; next} ($1 in a) || ($4 in a)' File1 File2 > Output
>-::NC_013316.1:2776-2968(+)
ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC
TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC
GCTCGGAAACGGACGCTAATACCGCATAC
>-::NC_013316.1:17374-17563(+)
AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA
>-::NC_013316.1:2786-2968(-)
GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC
>antisense_CDR20291_r27::NC_013316.1:10830-11707(-)
TATTTCTTGTTCCTTTTTTCAAGGACAAGTAAATAAATTAACCTACTGTTTAATTTTCAA
>antisense_CDR20291_r27::NC_013316.1:11814-11874(-)
TTCCTTTGAGTTTCACTCTTGCGAGCGTACTTCCCAGGCGGA
How can I process my files with awk (or similar) to achieve my goal?

With your shown samples, please try following. Written and tested with GNU awk.
awk '
FNR==NR{
arr[$1,$2]
next
}
/^>/{
found=""
if((($5,$6) in arr) || (($6,$5) in arr)){
found=1
}
}
found
' file1 FS=":|-|\\\\(" file2
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when file1 is being read.
arr[$1,$2] ##Creating arr with index of 1st and 2nd field.
next ##next will skip all further statements from here.
}
/^>/{ ##Checking condition if line starts from > then do following.
found="" ##Nullifying found here.
if((($5,$6) in arr) || (($6,$5) in arr)){ ##Checking condition if either 5th 6th field is present in arr OR 6th 5th field as a key present in arr then do following.
found=1 ##Setting found to 1 here.
}
}
found ##Checking condition if found is set then print that line.
' file1 FS=":|-|\\\\(" file2 ##Mentioning Input_file(s) and setting field separator before Input_file2 to get exact values to match.

Related

How to count values between empty cells

I'm facing one problem which is bigger than me. I have 18 relative large text files (ca 30k lines each) and I need to count the values between the empty cells in the second column. Here is a simple example of my file:
Metabolism
line_1 10.2
line_2 10.1
line_3 10.3
TCA_cycle
line_4 10.7
line_5 10.8
Pyruvate_metab
line_6 100.8
In reality, I have circa 500 description lines (Metabolism, TCA_cycle, etc.) and the range of lines is between zero to a few hundred.
I would like to count values for each block (block starts with a description and corresponding lines are always below), e.g.
Metabolism 30.6
line_1 10.2
line_2 10.1
line_3 10.3
TCA_cycle 21.5
line_4 10.7
line_5 10.8
Pyruvate_metab 100.8
line_6 100.8
Or just
30.3
21.5
100.8
It won't be a problem if results will be printed line by line into an additional file... Or another alternative way.
There is one tricky thing and it's descriptions without lines with numbers.
Transport
line_1000 100.1
line_1001 100.2
Cell_signal
Motility
Processing
Translation
line_1002 500.1
line_1003 200.2
And even for those lines and would like to get 0 value.
Transport 200.3
line_1000 100.1
line_1001 100.2
Cell_signal 0
Motility 0
Processing 0
Translation 700.3
line_1002 500.1
line_1003 200.2
The rest of the file looks same and it's consistent - 2 columns, tab separators, descriptions in the first column, values in the second, no spaces (only underlines).
Actually I have no experience with more sophisticated coding so I really don't know how to solve it in the command line. I've already tried some Excel ways but it was painful and unsuccessful.
With tac and any awk:
tac file | awk 'NF==2{sum+=$2; print; next} {print $1 "\t" sum; sum=0}' | tac
With two improvements proposed by kvantour and Ed Morton. See the comments.
tac file | awk '($NF+0==$NF){sum+=$2; print; next} {print $1 "\t" sum+0; sum=0}' | tac
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
Could you please try following, written and tested with shown samples in GNU awk.
awk '
FNR==NR{
if($0!~/line/){ a[$0]; prev=$0 }
else { a[prev]+=$NF }
next
}
!/line/{
$0=$0 OFS (a[$0]?a[$0]:0)
}
1' Input_file Input_file
OR in case you want output in good looking form add column -t to above command like as follows:
awk '
FNR==NR{
if($0!~/line/){ a[$0]; prev=$0 }
else { a[prev]+=$NF }
next
}
!/line/{
$0=$0 OFS (a[$0]?a[$0]:0)
}
1' Input_file Input_file | column -t
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking FNR==NR which will be TRUE when Input_file is being read first time.
if($0!~/line/){ a[$0]; prev=$0 } ##checking condition if line contains string line and setting index of current line in a and setting prev value to current line.
else { a[prev]+=$NF } ##Else if line not starting from line then creating array a with index prev variable and keep on adding last field value to same index of array.
next ##next will skip all further statements from here.
}
!/line/{ ##Checking if current line doesnot have line keyword in it then do following.
$0=$0 OFS (a[$0]?a[$0]:0) ##Re-creating current line with its current value then OFS(which is space by default) then either add value of a[$0] or 0 based on current line value is NOT NULL here.
}
1 ##Printing current line here.
' Input_file Input_file ##Mentioning Input_file names here.
In plain awk:
awk '{
if (NF == 1) {
if (blockname)
printf("%s\t%.2f\n%s", blockname, sum, lines)
blockname = $0
sum = 0
lines=""
} else if (NF == 2) {
sum += $2
lines = lines $0 "\n"
}
next
}
END { printf("%s\t%.2f\n%s", blockname, sum, lines) }
' input.txt

AWK print line from File A if string in 2 columns are both present in File B

I am trying to pull out lines where Col1 and Col2 from FileA are present in FileB.
For example,
FileA:
1000963 4852419 0.051 0.0103 0.1126
1001037 1957033 0.044 0.0154 0.0473
1001107 1690854 0.045 0.0145 0.0612
1001176 1996721 0.067 0 0.2494
FileB:
1281525
1000963
1690854
1001176
1001037
1957033
1996721
5784681
In the example above, I would expect the output to be:
1001037 1957033 0.044 0.0154 0.0473
1001176 1996721 0.067 0 0.2494
Note that the other two lines were not pulled out because only a string in one column (not both columns) was present in FileB.
Is there a way of doing this in awk? My attempts have not worked so far.
Thank you!
Could you please try following.
awk 'FNR==NR{a[$0];next} (($1 in a) && ($2 in a))' Input_file2 Input_file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when Input_file2 is being read.
a[$0] ##Creating array a with index $0 here.
next ##next will skip all further statements from here.
}
(($1 in a) && ($2 in a)) ##Checking if current line 1st and 2nd field both are present in array then print current line.
' Input_file2 Input_file1 ##Mentioning Input_file names here.

How to join two CSV files by a temporary common column in awk?

I have two CSV files in the form of
file1
A,44
A,21
B,65
C,79
file2
A,7
B,4
C,11
I used awk as
awk -F, 'NR==FNR{a[$1]=$0;next} ($1 in a){print a[$1]","$2 }' file1.csv file2.csv
producing
A,44,7
A,21,7
B,65,4
C,79,11
a[$1] prints the entire line from file1. How can I omit the first columns in both files (the first column is only used to match the second columns) to produce:
44,7
21,7
65,4
79,11
In other words, how can I pass the columns from the first file to the print block, as $2 does for the second file?
Could you please try following, tested and written on shown samples only.
awk 'BEGIN{FS=OFS=","} FNR==NR{a[$1]=$2;next} ($1 in a){print $2,a[$1]}' file2 file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS="," ##Setting field and output field separator as comma here.
}
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when file2 is being read.
a[$1]=$2 ##Creating array a with index $1 and value is $2 from current line.
next ##next will skip all further statement from here.
}
($1 in a){ ##Statements from here will be executed when file1 is being read and it's checking if $1 is present in array a then do following.
print $2,a[$1] ##Printing 2nd field and value of array a with index $1 here.
}
' file2 file1 ##Mentioning Input_file names here.
Output will be as follows for shown samples.
44,7
21,7
65,4
79,11
2nd solution: More Generic solution, where considering that your both Input_files could have duplicates in that case it will print 1st value of A in Input_file1 to first value of Input_file2 and so on.
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
a[$1]
b[$1,++c[$1]]=$2
next
}
($1 in a){
print $2,b[$1,++d[$1]]
}
' file2 file1
You can join them using the join command and chose which fields you want to have in the output:
kent$ join -t',' -o 1.2,2.2 file1 file2
44,7
21,7
65,4
79,11

bash compare two columns with exact match

I am comparing columns between two files for exact match but I am ending up with inaccurate result. Example as follows.
File1 File2
adam sunny
jhon adam
kelly adam
matt kevin
stuart adam
Gary Gary
When we look at the files there is only match i.e. Garry. My output should be following.
Emptyline
Emptyline
Emptyline
Emptyline
Emptyline
Gary
In order to achieve requirement. I am running the following command
awk 'NR==FNR { n[$1]=$0;next } ($1 in n) { print n[$1],$2 }' file1 file2
and I am getting output as follows
adam
adam
adam
Garry
You should be tracking line numbers, not just line contents:
$ awk 'NR==FNR { lines[NR]=$0; next }
{ if ($0 == lines[FNR]) print; else print "" }' file1.txt file2.txt
Gary
1st solution: With simple awk.
awk 'FNR==NR{a[FNR]=$0;next} a[FNR]==$0{print;next} {print ""}' file1 file2
OR as per anubhava sir's comment:
awk 'FNR==NR{a[FNR]=$0;next} a[FNR]!=$0{$0=""} 1' file1 file2
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first file Input_file1 is being read.
a[FNR]=$0 ##Creating an array a with index FNR and value of current line here.
next ##next will skip all further statements from here.
}
a[FNR]==$0{ ##Checking condition if value of array a with FNR index and current line is equal then do following.
print $0,a[FNR] ##Printing current line and value array a with index FNR here.
}
' file1 file2 ##Mentioning Input_file names here
2nd solution: Considering that your actual Input_file(s) have only 2 columns as per shown samples, could you please try following then.
paste Input_file1 Input_file2 | awk '$1==$2{print $1};$1!=$2{print ""}'
This code will only print lines whose values are equal in Input_file1 and Input_file2.

Comparing Only First 4 columns from File1 (CSV Formatted ) to First 4 columns in File2 and printing all columns from File1 i

I'm Trying to compare two Files in CSV format
First File is dynamic , daily the columns will be added ,
on second File it has only 4 columns ( static )
comparing First 4 columns on File1 to 4 columns on File2 and printing all columns from File1 which matches from File2
Ex :
File1
AIXTSM1,VHOST,10.199.114.72,DAILY_1800_VM_SDC-CTL-PROD3,COMP,COMP,COMP
AIXTSM1,VHOST,ADMET007,DAILY_1800_VM_SDC-CTL-PROD3,COMP,COMP,COMP
AIXTSM2,VHOST,ADMET014,DAILY_1900_VM_UDC-CTL-PROD,COMP,COMP,COMP
AIXTSM1,VHOST,AGGREGATE,DAILY_2200_VM_SDC-CTL-PROD5,COMP,COMP,COMP
AIXTSM1,PHOST,APLEE01,DAILY_2000_SU_W3,COMP,COMP,COMP
AIXTSM1,PHOST,APYRK02,DAILY_2000_SU_W3,COMP,COMP,COMP
AIXTSM2,PHOST,APYRK04,DAILY_1800_V7K,COMP,COMP,COMP
AIXTSM1,VHOST,ARCLIC01,DAILY_2200_VM_SDC-CTL-PROD5,COMP,COMP,COMP
AIXTSM2,PHOST,ARIELN,DAILY_1800_V7K,COMP,COMP,COMP
AIXTSM2,VHOST,ASMET005,DAILY_1900_VM_UDC-CTL-PROD,COMP,COMP,COMP
AIXTSM2,PHOST,ASMET014,WIN_INCRE_2000,COMP,COMP,COMP
AIXTSM1,VHOST,ASMET038,DAILY_1800_VM_SDC-CTL-PROD2,COMP,COMP,COMP
AIXTSM2,VHOST,ASMET042,DAILY_1900_VM_UDC-CTL-PROD,COMP,COMP,COMP
AIXTSM1,VHOST,ASMET044,DAILY_1800_VM_SDC-CTL-PROD3,COMP,COMP,COMP
AIXTSM2,VHOST,ASMET046,DAILY_1900_VM_UDC-CTL-PROD,COMP,COMP,COMP
AIXTSM2,VHOST,ASMET068,DAILY_1900_VM_UDC-CTL-PROD,COMP,COMP,COMP
AIXTSM2,VHOST,ASMET069,DAILY_1900_VM_UDC-CTL-PROD,COMP,COMP,COMP
AIXTSM2,VHOST,ASMET070,DAILY_1900_VM_UDC-CTL-PROD,COMP,COMP,COMP
AIXTSM2,VHOST,ASMET071,DAILY_1900_VM_UDC-CTL-PROD,COMP,COMP,COMP
AIXTSM2,VHOST,ASMET072,DAILY_1900_VM_UDC-CTL-PROD,COMP,COMP,COMP
AIXTSM2,VHOST,ASMET073,DAILY_1900_VM_UDC-CTL-PROD,COMP,COMP,COMP
AIXTSM1,PHOST,ASMET074,DAILY_INCR_1900,COMP,COMP,COMP
AIXTSM1,VHOST,ASMET084-T,DAILY_1800_VM_SDC-CTL-PROD3,COMP,COMP,COMP
File2
AIXTSM1,VHOST,10.199.114.72,DAILY_1800_VM_SDC-CTL-PROD
AIXTSM1,VHOST,ADMET007,DAILY_1800_VM_SDC-CTL-PROD3
AIXTSM2,VHOST,ADMET014,DAILY_1900_VM_UDC-CTL-PROD
AIXTSM1,VHOST,AGGREGATE,DAILY_2200_VM_SDC-CTL-PROD5
Result
AIXTSM1,VHOST,10.199.114.72,DAILY_1800_VM_SDC-CTL-PROD3,COMP,COMP,COMP,OK
AIXTSM1,VHOST,ADMET007,DAILY_1800_VM_SDC-CTL-PROD3,COMP,COMP,COMP,OK
AIXTSM2,VHOST,ADMET014,DAILY_1900_VM_UDC-CTL-PROD,COMP,COMP,COMP,OK
AIXTSM1,VHOST,AGGREGATE,DAILY_2200_VM_SDC-CTL-PROD5,COMP,COMP,COMP,OK
Code
awk -F, 'NR==FNR{ arr[$2]=$1 $2 $3 $4; next } { print $0, (arr[$2]==$1 $2 $3 $4?"OK":"NOK") }' OFS=, File2 File1
But it matches only first line .
I believe your expected output's first line is typo? Since all 4 fields of Input_file2 are NOT coming in Input_file1. Could you please try following.
awk 'BEGIN{FS=OFS=","}FNR==NR{a[$1,$2,$3,$4];next} (($1,$2,$3,$4) in a){print $0, "OK"}' Input_file2 Input_file1
Explanation: Adding explanation for above code too here.
awk -F, ' ##Mentioning field separator as comma(,) here for all lines of Input_file(s).
BEGIN{ ##Starting BEGIN section of awk program here.
FS=OFS="," ##Setting field separator and output field separator as comma(,) here.
}
FNR==NR{ ##FNR==NR condition will be when 1st Input_file named Input_file2 is being read.
a[$1,$2,$3,$4] ##Creating an array named a whose index is $1,$2,$3,$4 fields of Input_file2 lines.
next ##next will skip all further statements from here.
} ##Closing first condition block now.
(($1,$2,$3,$4) in a){ ##Checking condition if $1,$2,$3,$4 of Input_file1 are present in array a if yes then do following.
print $0,"OK" ##Printing current line with OFS and OK string here now.
}
' Input_file2 Input_file1 ##Mentioning Input_file name(s) Input_file2 and Input_file1 here.