Goal: To print lines of File2 when field 1 ($1) and field 4 ($4) of File1 both match a substring in field 4 ($4) on lines beginning with ">" in File2.
Important note #1: The lines being printed to output include the line being searched and all the lines following it until the next line with a ">".
Example: When fields 1 and 4 of File1 are 2776 & 2968 respectively, these should be searched against field 4 of File2 to evntually find the match 2776-2968(+) (because both numbers of File1 match a substring in field 4 of File2). The order of the numbers in the string does not matter - 2968-2776(+) should also be considered a match. Since they match, that line of File2 is printed with all lines below it until another line with ">" is encountered.
Important Note #2: File1 is tab-delimited: \t. File 2 is colon-delimited: :.
File1:
Transcription_Start Translation_Start Translation_Stop Transcription_Stop Strand Expression
2776 2968 + 920
17374 17563 + 1959
2968 2786 - 802
17563 17375 - 1694
19606 19395 - 1914
File2:
>-::NC_013316.1:2776-2968(+)
ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC
TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC
GCTCGGAAACGGACGCTAATACCGCATAC
>-::NC_013316.1:17374-17563(+)
AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA
>-::NC_013316.1:2786-2968(-)
GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC
>antisense_CDR20291_r27::NC_013316.1:10830-11707(-)
TATTTCTTGTTCCTTTTTTCAAGGACAAGTAAATAAATTAACCTACTGTTTAATTTTCAA
>antisense_CDR20291_r27::NC_013316.1:11814-11874(-)
TTCCTTTGAGTTTCACTCTTGCGAGCGTACTTCCCAGGCGGA
Desired Output:
>-::NC_013316.1:2776-2968(+)
ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC
TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC
GCTCGGAAACGGACGCTAATACCGCATAC
>-::NC_013316.1:17374-17563(+)
AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA
>-::NC_013316.1:2786-2968(-)
GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC
This is what I've tried so far (it outputs the full contents of File2, thus failing to produce the desired output):
$ awk -F"\t|:" 'NR==FNR{a[$4]; next} ($1 in a) || ($4 in a)' File1 File2 > Output
>-::NC_013316.1:2776-2968(+)
ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC
TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC
GCTCGGAAACGGACGCTAATACCGCATAC
>-::NC_013316.1:17374-17563(+)
AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA
>-::NC_013316.1:2786-2968(-)
GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC
>antisense_CDR20291_r27::NC_013316.1:10830-11707(-)
TATTTCTTGTTCCTTTTTTCAAGGACAAGTAAATAAATTAACCTACTGTTTAATTTTCAA
>antisense_CDR20291_r27::NC_013316.1:11814-11874(-)
TTCCTTTGAGTTTCACTCTTGCGAGCGTACTTCCCAGGCGGA
How can I process my files with awk (or similar) to achieve my goal?
With your shown samples, please try following. Written and tested with GNU awk.
awk '
FNR==NR{
arr[$1,$2]
next
}
/^>/{
found=""
if((($5,$6) in arr) || (($6,$5) in arr)){
found=1
}
}
found
' file1 FS=":|-|\\\\(" file2
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when file1 is being read.
arr[$1,$2] ##Creating arr with index of 1st and 2nd field.
next ##next will skip all further statements from here.
}
/^>/{ ##Checking condition if line starts from > then do following.
found="" ##Nullifying found here.
if((($5,$6) in arr) || (($6,$5) in arr)){ ##Checking condition if either 5th 6th field is present in arr OR 6th 5th field as a key present in arr then do following.
found=1 ##Setting found to 1 here.
}
}
found ##Checking condition if found is set then print that line.
' file1 FS=":|-|\\\\(" file2 ##Mentioning Input_file(s) and setting field separator before Input_file2 to get exact values to match.
table1.csv:
33622|AAA
33623|AAA
33624|BBB
33625|CCC
33626|DDD
33627|AAA
33628|BBB
33629|EEE
33630|FFF
Aims:
33622|AAA
33623|AAA
33624|BBB
33625|CCC
33626|DDD
33627|AAA
33628|BBB
Using command:
awk 'BEGIN{FS="|";OFS="|"} {if($2=="AAA" && $2=="BBB" && $2=="CCC" && $2=="DDD"){print $1,$2}}' table1.csv
However, trying to be more automatic, since the categories may increase.
list1.csv:
AAA BBB CCC DDD
list=`cat list1.csv`
awk -v list=$list 'BEGIN{FS="|";OFS="|"} {if($2==list){print $1,$2}}' table1.csv
Which means, can I stored $2=="AAA" && $2=="BBB" ....... into a variable by using list1.csv?
Expected output:
33622|AAA
33623|AAA
33624|BBB
33625|CCC
33626|DDD
33627|AAA
33628|BBB
So, any suggestion on storing the multiple condition in one variable?
Thanks all!
$ awk 'NR==FNR{for(i=1;i<=NF;i++)a[$i];next}FNR==1{FS="|";$0=$0}($2 in a)' list table
Output:
33622|AAA
33623|AAA
33624|BBB
33625|CCC
33626|DDD
33627|AAA
33628|BBB
Explained:
$ awk '
NR==FNR { # process list
for(i=1;i<=NF;i++) # hash all items in file
a[$i]
next # possibility for multiple lines
}
FNR==1 { # changing FS in the beginning of table file
FS="|"
$0=$0
}
($2 in a)' list table
Almost same logic Like James Brown's nice answer, just adding here a small variant which is setting field separator in Input_file places itself.
awk 'FNR==NR{for(i=1;i<=NF;i++){arr[$i]};next} ($2 in arr)' list FS="|" table
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when list is being read.
for(i=1;i<=NF;i++){ ##Going through all fields here.
arr[$i] ##Creating arr with index of current column value here.
}
next ##next will skip all further statements from here.
}
($2 in arr) ##Checking condition if 2nd field is present in arr then print that line from table file.
' list FS="|" table ##mentioning Input_file(s) here and setting FS as | before table file.
I am comparing columns between two files for exact match but I am ending up with inaccurate result. Example as follows.
File1 File2
adam sunny
jhon adam
kelly adam
matt kevin
stuart adam
Gary Gary
When we look at the files there is only match i.e. Garry. My output should be following.
Emptyline
Emptyline
Emptyline
Emptyline
Emptyline
Gary
In order to achieve requirement. I am running the following command
awk 'NR==FNR { n[$1]=$0;next } ($1 in n) { print n[$1],$2 }' file1 file2
and I am getting output as follows
adam
adam
adam
Garry
You should be tracking line numbers, not just line contents:
$ awk 'NR==FNR { lines[NR]=$0; next }
{ if ($0 == lines[FNR]) print; else print "" }' file1.txt file2.txt
Gary
1st solution: With simple awk.
awk 'FNR==NR{a[FNR]=$0;next} a[FNR]==$0{print;next} {print ""}' file1 file2
OR as per anubhava sir's comment:
awk 'FNR==NR{a[FNR]=$0;next} a[FNR]!=$0{$0=""} 1' file1 file2
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first file Input_file1 is being read.
a[FNR]=$0 ##Creating an array a with index FNR and value of current line here.
next ##next will skip all further statements from here.
}
a[FNR]==$0{ ##Checking condition if value of array a with FNR index and current line is equal then do following.
print $0,a[FNR] ##Printing current line and value array a with index FNR here.
}
' file1 file2 ##Mentioning Input_file names here
2nd solution: Considering that your actual Input_file(s) have only 2 columns as per shown samples, could you please try following then.
paste Input_file1 Input_file2 | awk '$1==$2{print $1};$1!=$2{print ""}'
This code will only print lines whose values are equal in Input_file1 and Input_file2.
Add new column with value of how many times the values in columns 1 and 2 contends exactly same value.
input file
46849,39785,2,012,023,351912.29,2527104.70,174.31
46849,39785,2,012,028,351912.45,2527118.70,174.30
46849,39785,3,06,018,351912.12,2527119.51,174.33
46849,39785,3,06,020,351911.80,2527105.83,174.40
46849,39797,2,012,023,352062.45,2527118.50,173.99
46849,39797,2,012,028,352062.51,2527105.51,174.04
46849,39797,3,06,020,352063.29,2527116.71,174.13,
46849,39809,2,012,023,352211.63,2527104.81,173.74
46849,39809,2,012,028,352211.21,2527117.94,173.69
46849,39803,2,012,023,352211.63,2527104.81,173.74
46849,39803,2,012,028,352211.21,2527117.94,173.69
46849,39801,2,012,023,352211.63,2527104.81,173.74
Expected output file:
4,46849,39785,2,012,023,351912.29,2527104.70,174.31
4,46849,39785,2,012,028,351912.45,2527118.70,174.30
4,46849,39785,3,06,018,351912.12,2527119.51,174.33
4,46849,39785,3,06,020,351911.80,2527105.83,174.40
3,46849,39797,2,012,023,352062.45,2527118.50,173.99
3,46849,39797,2,012,028,352062.51,2527105.51,174.04
3,46849,39797,3,06,020,352063.29,2527116.71,174.13,
2,46849,39809,2,012,023,352211.63,2527104.81,173.74
2,46849,39809,2,012,028,352211.21,2527117.94,173.69
2,46849,39803,2,012,023,352211.63,2527104.81,173.74
1,46849,39803,2,012,028,352211.21,2527117.94,173.69
1,46849,39801,2,012,023,352211.63,2527104.81,173.74
attempt:
awk -F, '{x[$1 $2]++}END{ for(i in x) {print i,x[i]}}' file
4684939785 4
4684939797 3
4684939801 1
4684939803 2
4684939809 2
Could you please try following.
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
a[$1,$2]++
next
}
{
print a[$1,$2],$0
}
' Input_file Input_file
Explanation: reading Input_file 2 times. Where first time I am creating an array named a with index of first and second field and counting their value on each occurrence too. On 2nd time file reading it printing count of the first 2 fields total and then printing while line.
One liner code:
awk 'BEGIN{FS=OFS=","} FNR==NR{a[$1,$2]++;next} {print a[$1,$2],$0}' Input_file Input_file
Hi Everyone I have below data.
61684 376 23 106 38695633 1 0 0 -1 /C/Program Files (x86)/ 16704 root;TrustedInstaller#NT:SERVICE root;TrustedInstaller#NT:SERVICE 0 1407331175 1407331175 1247541608
8634 416 13 86 574126 1 0 0 -1 /E/KYCImages/ 16832 root;kycfinal#CGKYCAPP03 root;None#CGKYCAPP03 0 1406018846 1406018846 1352415392
60971 472 22 86 38613076 1 0 0 -1 /E/KYCwebsvc binaries/ 16832 root;kycfinal#CGKYCAPP03 root;None#CGKYCAPP03 0 1390829495 1390829495 1353370744
1 416 10 86 1 1 0 0 -1 /E/KycApp/ 16832 root;kycfinal#CGKYCAPP03 root;None#CGKYCAPP03 0 1411465772 1411465772 1351291187
Now I am using below code:
awk 'BEGIN{FPAT = "([^ ]+)|(\"[^\"]+\")"}{print $10}' | awk '$1!~/^\/\./' | sort -u | sed -e 's/\,//g' | perl -p00e 's/\n(?!\Z)/;/g' filename
I am getting this output
/C/Program;/E/KycApp/;/E/KYCImages/;/E/KycServices/;/E/KYCwebsvc
However I need to start the output from $10 till "/" is again encountered, basically I want to ignore any spaces from column 10 till "/" is encountered.
Is it possible?
Desired output is
/C/Program Files (x86)/;/E/KycApp/;/E/KYCImages/;/E/KycServices/;/E/KYCwebsvc binaries/
With single gawk:
awk 'BEGIN{ FPAT="/[^/]+/[^/]+/"; PROCINFO["sorted_in"]="#ind_str_asc"; IGNORECASE = 1 }
{ a[$1] }END{ for(i in a) r=(r!="")? r";"i : i; print r }' filename
The output (without /E/KycServices/; - cause it's not within your input):
/C/Program Files (x86)/;/E/KycApp/;/E/KYCImages/;/E/KYCwebsvc binaries/
try following too in single awk.
awk '{match($0,/\/.*\//);VAL=VAL?VAL ORS substr($0,RSTART,RLENGTH):substr($0,RSTART,RLENGTH)} END{num=split(VAL, array,"\n");for(i=1;i<=num;i++){printf("%s%s",array[i],i==num?"":";")};print""}' Input_file
Will add non-one liner form of solution with explanation too shortly.
EDIT1: Adding non-one liner form of solution successfully too now.
awk '{
match($0,/\/.*\//);
VAL=VAL?VAL ORS substr($0,RSTART,RLENGTH):substr($0,RSTART,RLENGTH)
}
END{
num=split(VAL, array,"\n");
for(i=1;i<=num;i++){
printf("%s%s",array[i],i==num?"":";")
};
print""
}
' Input_file
EDIT2: Adding explanation of code in non-one liner form of solution too now.
awk '{
match($0,/\/.*\//); ##Using match functionality of awk which will match regex to find the string in a line from / to \, note I am escaping them here too.
VAL=VAL?VAL ORS substr($0,RSTART,RLENGTH):substr($0,RSTART,RLENGTH) ##creating a variable named VAL here which will concatenate its own value if more than one occurrence are there. Also RSTART and RSTART are the variables of built-in awk which will be having values once a match has TRUE value which it confirms once a regex match is found in a line.
}
END{ ##Starting this block here.
num=split(VAL, array,"\n");##creating an variable num whose value will be number of elements in array named array, split is a built-in keyword of awk which will create an array with a defined delimiter, here it is new line.
for(i=1;i<=num;i++){ ##Starting a for loop here whose value will go till num value from i variable value 1 to till num.
printf("%s%s",array[i],i==num?"":";") ##printing the array value whose index is variable i and second string it is printing is semi colon, there a condition is there if i value is equal to num then print null else print a semi colon.
};
print"" ##print NULL value to print a new line.
}
' Input_file ###Mentioning the Input_file here.