Compare 2 files on specific row only - awk

I need to compare 2 files and find the matching rows.
The only problem is that I need to check the 4th row out of 5 from DocumentList file and return the entire line if a match is found in final file.
cat DocumentList.xml
<?xml version="1.0" encoding="UTF-8" ?> <block-list:block-list xmlns:block-list="http://openoffice.org/2001/block-list">
<block-list:block block-list:abbreviated-name="adn" block-list:name="and" />
<block-list:block block-list:abbreviated-name="tesst" block-list:name="test" />
<block-list:block block-list:abbreviated-name="tust" block-list:name="test" />
<block-list:block block-list:abbreviated-name="seme" block-list:name="same"/>
And the second file is:
cat final.txt
and
test
india
I can extract the forth row using this command, but do not know how to compare with the lines from final file
awk -F '\"' '{print $4}' DocumentList.xml
Expected Result:
<block-list:block block-list:abbreviated-name="adn" block-list:name="and" />
<block-list:block block-list:abbreviated-name="tesst" block-list:name="test" />
<block-list:block block-list:abbreviated-name="tust" block-list:name="test" />
I have also tried something like this, but it does not return the entire line from DocumetList file.
awk -F '\"' 'FNR==NR {a[$4]; next} $1 in a' DocumentList.xml final.txt
final.txt file is 1 GB, DocumentList is 25 MB and both have unicode characters.

Just swap the order of reading files:
awk -F '\"' 'FNR==NR {a[$0]; next} $4 in a' final.txt DocumentList.xml
Output:
<block-list:block block-list:abbreviated-name="adn" block-list:name="and" />
<block-list:block block-list:abbreviated-name="tesst" block-list:name="test" />
<block-list:block block-list:abbreviated-name="tust" block-list:name="test" />

With your shown samples, please try following awk code. Written and tested in GNU awk.
awk '
FNR==NR{
arr1[$0]
next
}
match($0,/block-list:name="([^"]*)"/,arr2) && (arr2[1] in arr1)
' final.txt DocumentList.xml
Explanation: In awk program reading both the Input_file(s) named final.txt and DocumentList.xml. Then in main program using condition FNR==NR which will be TRUE when final.txt is being read. In that condition block I am creating an array named arr1 whose index is current line and then using next will skip all further statements from there onwards. Then I have used match function of awk, which matches regex mentioned in it(block-list:name="([^"]*)") this matches everything from block-list:name= followed by a " till next occurrence of " comes keep in mind () creates values and stores them into array named arr2 which we will access later on. Then using && (arr2[1] in arr1) condition to check if value of arr2's 1st element comes in array arr1 then print that line(basically matching values of final.txt and needed value from DocumentList.xml).

you can try,
search=$(awk 'BEGIN{OFS=ORS=""}{if(NR>1){print "|"}print $1;}' final.txt)
# store 'and|test|india' in search variable
grep -E "block-list:name=\"($search)\"" DocumentList.xml
you get,
<block-list:block block-list:abbreviated-name="adn" block-list:name="and" />
<block-list:block block-list:abbreviated-name="tesst" block-list:name="test" />
<block-list:block block-list:abbreviated-name="tust" block-list:name="test" />
Or using, 'awk'
awk 'BEGIN{FS="block-list:name=\""}
FNR==NR {a[$1]; next} {f=$2;gsub(/".*/,"",f)}
FNR>1 && f in a{print $0}
' final.txt DocumentList.xml
Note: for xml files, I don't recommend how you are doing it, is better to use xml-parser

Related

how to get all attribute name and value from HTML code using awk or sed

Here is my file that has each line as mentioned below:
<span class="bcd">abc</span><span class="icon"></span>
here is expected output:
href="abc.com"
aria-label="abc ofe"
class="abc"
class="bcd"
class="icon"
Here is what I got:
awk '{for(i=1; i<=NF; ++i)printf "%s%s", $i, (i<NFi?FS:(i<NF?"\n":RS))}'
echo "<span class="bcd">abc</span><span class="icon"></span>" | awk '{for(i=1; i<=NF; ++i)printf "%s%s", $i, (i<NFi?FS:(i<NF?"\n":RS))}'
gives me:
<a
href=abc.com
aria-label=abc
ofe
class=abc><span
class=bcd>abc</span><span
class=icon></span></a>
Trying to get "attribute" string before double quote
and "attribute value" string between double quote.
Need to use awk or sed for macOS.
With your shown samples, you could try following awk code. Simple explanation would be, set RS(record separator) to different values(shown by OP required in output) and then print respective values.
awk -v RS='href="[^"]*"|aria-label="[^"]*"|class="[^"]*"' 'RT{print RT}' Input_file
With GNU awk and a regex:
awk '{$1=$1}1' FPAT='[^= ]+="[^"]+"' OFS="\n" file
Output:
href="abc.com"
aria-label="abc ofe"
class="abc"
class="bcd"
class="icon"
FPAT: A regular expression describing the contents of the fields in a record. When set, gawk parses the input into fields, where the fields
match the regular expression, instead of using the value of FS as the field separator.
See: man awk and 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
Try this Perl solution:
$ cat sharad.txt
<span class="bcd">abc</span><span class="icon"></span>
$ perl -0777 -ne ' while(/\<([^>]+)>/sg){ $a=$1; while($a=~/(\S+?\".+?\")/mg) { print "$1\n" } } ' sharad.txt
href="abc.com"
aria-label="abc ofe"
class="abc"
class="bcd"
class="icon"
$

Why doesn't awk and gsub remove only the dot?

This awk command:
awk -F ',' 'BEGIN {line=1} {print line "\n0" gsub(/\./, ",", $2) "0 --> 0" gsub(/\./, ",", $3) "0\n" $10 "\n"; line++}' file
is supposed to convert these lines:
Dialogue: 0,1:51:19.56,1:51:21.13,Default,,0000,0000,0000,,Hello!
into these:
1273
01:51:19.560 --> 01:51:21.130
Hello!
But somehow I'm not able to make gsub behave to replace the . by , and instead get 010 as both gsub results. Can anyone spot the issue?
Thanks
The return value from gsub is not the result from the substitution. It returns the number of substitutions it performed.
You want to gsub first, then print the modified string, which is the third argument you pass to gsub.
awk -F ',' 'BEGIN {line=1}
{ gsub(/\./, ",", $2);
gsub(/\./, ",", $3);
print line "\n0" $2 "0 --> 0" $3 "0\n" $10 "\n";
line++}' file
Another way is to use GNU awk's gensub instead of gsub:
$ awk -F ',' '
{
print NR ORS "0" gensub(/\./, ",","g", $2) "0 --> 0" gensub(/\./, ",","g",$3) "0" ORS $10 ORS
}' file
Output:
1
01:51:19,560 --> 01:51:21,130
Hello!
It's not as readable as the gsub solution by #tripleee but there is a place for it.
Also, I replace the line with builtin NR and \ns with ORS.

How to fetch a particular string using a sed command

I have an input string like below:
VAL:1|b:2|c:3|VAL:<har:919876543210#abc.com>; tag=vy6r5BpcvQ|VAl:1234|name:mnp|VAL:91987654321
Like this, there are more than 1000 rows.
I want to fetch the value of the first parameter, i.e., the a field and d field, but for the d field I want only har:919876543210#abc.com.
I tried like this:
cat $filename | grep -v Orig |sed -e 's/['a:','d:']//g' |awk -F'|' -v OFS=',' '{print $1 "," $4}' >> $NGW_DATA_FILE
The output I got is below:
1,<har919876543210#abc.com>; tag=vy6r5BpcvQ
I want it like this,
1,har:919876543210#abc.com
Where did I make the mistake and how do I solve it?
EDIT: As per OP's change of Input_file and OP's comments, adding following now.
awk '
BEGIN{ FS="|"; OFS="," }
{
sub(/[^:]*:/,"",$1)
gsub(/^[^<]*|; .*/,"",$4)
gsub(/^<|>$/,"",$4)
print $1,$4
}' Input_file
With shown samples, could you please try following, written and tested with shown samples in GNU awk.
awk '
BEGIN{
FS="|"
OFS=","
}
{
val=""
for(i=1;i<=NF;i++){
split($i,arr,":")
if(arr[1]=="a" || arr[1]=="d"){
gsub(/^[^:]*:|; .*/,"",$i)
gsub(/^<|>$/,"",$i)
val=(val?val OFS:"")$i
}
}
print val
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS="|" ##Setting FS as pipe here.
OFS="," ##Setting OFS as comma here.
}
{
val="" ##Nullify val here(to avoid conflicts of its value later).
for(i=1;i<=NF;i++){ ##Traversing through all fields here
split($i,arr,":") ##Splitting current field into arr with delimiter by :
if(arr[1]=="a" || arr[1]=="d"){ ##Checking condition if first element of arr is either a OR d
gsub(/^[^:]*:|; .*/,"",$i) ##Globally substituting from starting till 1st occurrence of colon OR from semi colon to everything with NULL in $i.
val=(val?val OFS:"")$i ##Creating variable val which has current field value and keep adding in it.
}
}
print val ##printing val here.
}
' Input_file ##Mentioning Input_file name here.
You may also try this AWK script:
cat file
VAL:1|b:2|c:3|VAL:<har:919876543210#abc.com>; tag=vy6r5BpcvQ|VAl:1234|name:mnp|VAL:91987654321
awk -F '[|;]' '{
s=""
for (i=1; i<=NF; ++i)
if ($i ~ /^VAL:/) {
gsub(/^[^:]+:|[<>]*/, "", $i)
s = (s == "" ? "" : s "," ) $i
}
print s
}' file
1,har:919876543210#abc.com
You can do the same thing with sed rather easily using Extended Regex, two capture groups and two back-references, e.g.
sed -E 's/^[^:]*:(\w+)[^<]*[<]([^>]+).*$/\1,\2/'
Explanation
's/find/replace/' standard substitution, where the find is;
^[^:]*: from the beginning skip through the first ':', then
(\w+) capture one or more word characters ([a-zA-Z0-9_]), then
[^<]*[<] consume zero or more characters not a '<', then the '<', then
([^>]+) capture everything not a '>', and
.*$ discard all remaining chars in line, then the replace is
\1,\2 reinsert the captured groups separated by a comma.
Example Use/Output
$ echo 'a:1|b:2|c:3|d:<har:919876543210#abc.com>; tag=vy6r5BpcvQ|' |
sed -E 's/^[^:]*:(\w+)[^<]*[<]([^>]+).*$/\1,\2/'
1,har:919876543210#abc.com

How to not remove the header while executing awk

I have a file file like this :
k_1_1
k_1_3
k_1_6
...
I have a file file2 :
0,1,2,3,...
k_1_1,17,16,15,...
k_1_2,17,89,15,...
k_1_3,10,26,45,...
k_1_4,17,16,15,...
k_1_5,10,26,45,...
k_1_6,17,16,15,...
...
I want to print lines of file2 that is matched with fileThe desired output is :
0,1,2,3,...
k_1_1,17,16,15,...
k_1_3,10,26,45,...
k_1_6,17,16,15,...
I tried
awk 'BEGIN{FS=OFS=","}NR==FNR{a[$1];next}$1 in a {print $0}' file file2 > result
But the header line is gone in result like this :
k_1_1,17,16,15,...
k_1_3,10,26,45,...
k_1_6,17,16,15,...
How can a maintain it? Thank you.
Always print the first line, unconditionally.
awk 'BEGIN{FS=OFS=","}
NR==FNR{a[$1];next}
FNR==1 || $1 in a' file file2 > result
Notice also how { print $0 } is not necessary because it's the default action.
A very ad-hoc solution to your problem could be to compose the output in a command group:
{ head -1 file2; awk 'BEGIN{FS=OFS=","}NR==FNR{a[$1];next}$1 in a {print $0}' file file2; } > result
Could you please try following.
awk -F, 'FNR==NR{a[$1]=$0;next} FNR==1 && ++count==1{print;next} a[$1]' Input_file Input_file2
OR
awk -F, 'FNR==NR{a[$1]=$0;next} FNR==1{print;next} a[$1]' Input_file Input_file2

Edit header file with awk

I have a file that is white-space separated value, i need to convert this into:
header=tab separated,
records=" ; " separated (space-semicolon-space)
what i'm doing now is:
cat ${original} | awk 'END {FS=" "} { for(i=1; i<=NR; i++) {if (i==1) { OFS="\t"; print $0; } else { OFS=";" ;print $0; }}}' > ${new}
But is working only partly, first, it produces millions of lines, while the original ones has about 90000.
Second, the header, which should be modified here:
if (i==1) { OFS="\t"; print $0; }
Is not modified at all,
Another option would be by using sed, i can get that job to be done partially, but again the header remains untouched:
cat ${original} | sed 's/\t/ ;/g' > ${new}
this line should change all the separator in file
awk -F'\t' -v OFS=";" '$1=$1' file
this will leave header untouched:
awk -F'\t' -v OFS=";" 'NR>1{$1=$}1' file
this will only change the header line:
awk -F'\t' -v OFS=";" 'NR==1{$1=$1}1' file
you could paste some example to let us know why your header was not modified.