i've a log file like this
some text line
other text line
<a>
<b>1</b>
<c>2</c>
</a>
another text line
<a>
<b>1</b>
<c>2</c>
</a>
yet another text line
I need to get only ther first occurrence of the XML "a":
<a>
<b>1</b>
<c>2</c>
</a>
I know
awk '/<a>/,/<\/a>/' file.log
will find all occurrences, how can I get just the first? (adding |head -n1 obvously doesn't work because it will capture only first line, and I can't know for sure how long "a" is because the awk expression must be generic because I've different log files with different "a" contents)
Another slight variation is to simply use a simple counter variable to indicate when you are in the first <a>...</a> block, outputting that block and then exiting afterwards. In your case using n as the variable to indicate in the first block, e.g.
awk -v n=0 '$1=="</a>" {print $1; exit} $1=="<a>" {n=1}; n==1' f.xml
Example Use/Output
With your input file as f.xml you would get:
$ awk -v n=0 '$1=="</a>" {print $1; exit} $1=="<a>" {n=1}; n==1' f.xml
<a>
<b>1</b>
<c>2</c>
</a>
(note: the {n=1} and n==1 rules rely on the default operation (print) to output the record)
This awk:
awk '
match($0,/<a>/) {
$0=substr($0,RSTART)
flag=1
}
match($0,/<\/a/) {
$0=substr($0,1,RSTART+RLENGTH)
print
exit
}
flag' file
can handle these forms:
The above awk handles this:
<a><b>1</b><c>2</c></a>
and this:
<a>
<b>1</b>
<c>2</c>
</a>
and also <a>
<b>1</b>
<c>2</c>
</a> this
the end
Another for GNU awk:
$ gawk -v RS="</?a>" '
NR==1 { printf RT }
NR==2 { print $0 RT }
' file
First:
$ awk '/<a>/{f=1} f; /<\/a>/{exit}' file
<a>
<b>1</b>
<c>2</c>
</a>
Last:
$ tac file | awk '/<\/a>/{f=1} f; /<a>/{exit}' | tac
<a>
<b>1</b>
<c>2</c>
</a>
Nth:
$ awk -v n=2 '/<a>/{c++} c==n{print; if (/<\/a>/) exit}' file
<a>
<b>1</b>
<c>2</c>
</a>
Related
I need to compare 2 files and find the matching rows.
The only problem is that I need to check the 4th row out of 5 from DocumentList file and return the entire line if a match is found in final file.
cat DocumentList.xml
<?xml version="1.0" encoding="UTF-8" ?> <block-list:block-list xmlns:block-list="http://openoffice.org/2001/block-list">
<block-list:block block-list:abbreviated-name="adn" block-list:name="and" />
<block-list:block block-list:abbreviated-name="tesst" block-list:name="test" />
<block-list:block block-list:abbreviated-name="tust" block-list:name="test" />
<block-list:block block-list:abbreviated-name="seme" block-list:name="same"/>
And the second file is:
cat final.txt
and
test
india
I can extract the forth row using this command, but do not know how to compare with the lines from final file
awk -F '\"' '{print $4}' DocumentList.xml
Expected Result:
<block-list:block block-list:abbreviated-name="adn" block-list:name="and" />
<block-list:block block-list:abbreviated-name="tesst" block-list:name="test" />
<block-list:block block-list:abbreviated-name="tust" block-list:name="test" />
I have also tried something like this, but it does not return the entire line from DocumetList file.
awk -F '\"' 'FNR==NR {a[$4]; next} $1 in a' DocumentList.xml final.txt
final.txt file is 1 GB, DocumentList is 25 MB and both have unicode characters.
Just swap the order of reading files:
awk -F '\"' 'FNR==NR {a[$0]; next} $4 in a' final.txt DocumentList.xml
Output:
<block-list:block block-list:abbreviated-name="adn" block-list:name="and" />
<block-list:block block-list:abbreviated-name="tesst" block-list:name="test" />
<block-list:block block-list:abbreviated-name="tust" block-list:name="test" />
With your shown samples, please try following awk code. Written and tested in GNU awk.
awk '
FNR==NR{
arr1[$0]
next
}
match($0,/block-list:name="([^"]*)"/,arr2) && (arr2[1] in arr1)
' final.txt DocumentList.xml
Explanation: In awk program reading both the Input_file(s) named final.txt and DocumentList.xml. Then in main program using condition FNR==NR which will be TRUE when final.txt is being read. In that condition block I am creating an array named arr1 whose index is current line and then using next will skip all further statements from there onwards. Then I have used match function of awk, which matches regex mentioned in it(block-list:name="([^"]*)") this matches everything from block-list:name= followed by a " till next occurrence of " comes keep in mind () creates values and stores them into array named arr2 which we will access later on. Then using && (arr2[1] in arr1) condition to check if value of arr2's 1st element comes in array arr1 then print that line(basically matching values of final.txt and needed value from DocumentList.xml).
you can try,
search=$(awk 'BEGIN{OFS=ORS=""}{if(NR>1){print "|"}print $1;}' final.txt)
# store 'and|test|india' in search variable
grep -E "block-list:name=\"($search)\"" DocumentList.xml
you get,
<block-list:block block-list:abbreviated-name="adn" block-list:name="and" />
<block-list:block block-list:abbreviated-name="tesst" block-list:name="test" />
<block-list:block block-list:abbreviated-name="tust" block-list:name="test" />
Or using, 'awk'
awk 'BEGIN{FS="block-list:name=\""}
FNR==NR {a[$1]; next} {f=$2;gsub(/".*/,"",f)}
FNR>1 && f in a{print $0}
' final.txt DocumentList.xml
Note: for xml files, I don't recommend how you are doing it, is better to use xml-parser
Here is my file that has each line as mentioned below:
<span class="bcd">abc</span><span class="icon"></span>
here is expected output:
href="abc.com"
aria-label="abc ofe"
class="abc"
class="bcd"
class="icon"
Here is what I got:
awk '{for(i=1; i<=NF; ++i)printf "%s%s", $i, (i<NFi?FS:(i<NF?"\n":RS))}'
echo "<span class="bcd">abc</span><span class="icon"></span>" | awk '{for(i=1; i<=NF; ++i)printf "%s%s", $i, (i<NFi?FS:(i<NF?"\n":RS))}'
gives me:
<a
href=abc.com
aria-label=abc
ofe
class=abc><span
class=bcd>abc</span><span
class=icon></span></a>
Trying to get "attribute" string before double quote
and "attribute value" string between double quote.
Need to use awk or sed for macOS.
With your shown samples, you could try following awk code. Simple explanation would be, set RS(record separator) to different values(shown by OP required in output) and then print respective values.
awk -v RS='href="[^"]*"|aria-label="[^"]*"|class="[^"]*"' 'RT{print RT}' Input_file
With GNU awk and a regex:
awk '{$1=$1}1' FPAT='[^= ]+="[^"]+"' OFS="\n" file
Output:
href="abc.com"
aria-label="abc ofe"
class="abc"
class="bcd"
class="icon"
FPAT: A regular expression describing the contents of the fields in a record. When set, gawk parses the input into fields, where the fields
match the regular expression, instead of using the value of FS as the field separator.
See: man awk and 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
Try this Perl solution:
$ cat sharad.txt
<span class="bcd">abc</span><span class="icon"></span>
$ perl -0777 -ne ' while(/\<([^>]+)>/sg){ $a=$1; while($a=~/(\S+?\".+?\")/mg) { print "$1\n" } } ' sharad.txt
href="abc.com"
aria-label="abc ofe"
class="abc"
class="bcd"
class="icon"
$
This is basically a awk question but it is about processing data for the Moodle
Gift format, thus the tags.
I want to format html code in a question (Moodle "test" activity) but I need to replace < and > with the corresponding entities, as these will be interpreted as "real" html, and not printed.
However, I want to be able to type the question with regular code and post-process the file before importing it as gift into Moodle.
I thought awk would be the perfect tool to do this.
Say I have this (invalid as such) Moodle/gift question:
::q1::[html]This is a question about HTML:
<pre>
<p>some text</p>
</pre>
and some tag:<code><img></code>
{T}
What I want is a script that translates this into a valid gift question:
::q1::[html]This is a question about HTML:
<pre>
<p>some text</p>
</pre>
and some tag:<code><img></code>
{T}
key point: replace < and > with < and > when:
inside a <pre>-</pre> bloc (assuming those tags are alone on a line)
between <code>and </code>, with arbitrary string in between.
For the first part, I'm fine. I have a shell script calling awk (gawk, actually).
awk -f process_src2gift.awk $1.src >$1.gift
with process_src2gift.awk:
BEGIN { print "// THIS IS A GENERATED FILE !" }
{
if( $1=="<pre>" ) # opening a "code" block
{
code=1;
print $0;
}
else
{
if( $1=="</pre>" ) # closing a "code" block
{
code=0;
print $0;
}
else
{ # if "code block", replace < > by html entities
if( code==1 )
{
gsub(">","\\>");
gsub("<","\\<");
}
print $0;
}
}
}
END { print "// END" }
However, I'm stuck with the second requirement..
Questions:
Is it possible to add to my awk script code to process the hmtl code inside the <code> tags? Any idea ? I thought about using sed but I didn't see how to do that.
Maybe awk isn't the right tool for that ? I'm open for any suggestion on other (standard Linux) tool.
Answering own question.
I found a solution by doing a two step awk process:
first step as described in question
second step by defining <code> or </code> as field delimiter, using a regex, and process the string replacement on second argument ($2).
The shell file becomes:
echo "Step 1"
awk -f process_src2gift.awk $1.src >$1.tmp
echo "Step 2"
awk -f process_src2gift_2.awk $1.tmp >$1.gift
rm $1.tmp
And the second awk file (process_src2gift_2.awk) will be:
BEGIN { FS="[<][/]?[c][o][d][e][>]"; }
{
gsub(">","\\>",$2);
gsub("<","\\<",$2);
if( NF >= 3 )
print $1 "<code>" $2 "</code>" $3
else
print $0
}
Of course, there are limitations:
no attributes in the <code> tag
only one pair <code></code> in the line
probably others...
I have a file that is white-space separated value, i need to convert this into:
header=tab separated,
records=" ; " separated (space-semicolon-space)
what i'm doing now is:
cat ${original} | awk 'END {FS=" "} { for(i=1; i<=NR; i++) {if (i==1) { OFS="\t"; print $0; } else { OFS=";" ;print $0; }}}' > ${new}
But is working only partly, first, it produces millions of lines, while the original ones has about 90000.
Second, the header, which should be modified here:
if (i==1) { OFS="\t"; print $0; }
Is not modified at all,
Another option would be by using sed, i can get that job to be done partially, but again the header remains untouched:
cat ${original} | sed 's/\t/ ;/g' > ${new}
this line should change all the separator in file
awk -F'\t' -v OFS=";" '$1=$1' file
this will leave header untouched:
awk -F'\t' -v OFS=";" 'NR>1{$1=$}1' file
this will only change the header line:
awk -F'\t' -v OFS=";" 'NR==1{$1=$1}1' file
you could paste some example to let us know why your header was not modified.
I have an input file in following manner
<td> Name1 </td>
<td> <span class="test">Link </span></td>
<td> Name2 </td>
<td> <span class="test">Link </span></td>
I want a awk script to read this file and output in following manner
url1 Name1
url2 Name2
Can anyone help me out in this trivial looking problem? Thanks.
Extracting one href per is relatively simple, so long as they conform to XHTML standards and there is only at most one on a line and you don't care about enclosing tags, but perl is easier:
$ perl -ne 'print "$1\n" if /href="([^"]+)"/'
If you care about enclosing tags or they are not standard conformant, you cannot use regular expressions to parse HTML. It is impossible.
added: oops, you do care about context, forget about regexps and use a real HTML parser
Here is an awk script that does the job
awk '
/a href=\".*\"/ { sub( /^.*a href=\"/,"" ); sub(/\".*/,""); print $0, name }
{ name = $2 }
'
this might work:
awk 'BEGIN
{i=1}{line[i++]=$0}
END
{
j=1;
while (j<i)
{print line[j+1] line[j]; j+=2}
}' yourfile|awk '{print substr($4,7,length($4)-6),$6}'
gawk '/^<td>/ {n = $2; getline; print gensub(/.*href="([^"]*).*/,"\\1",1), n}' infile
url1 Name1
url2 Name2
awk 'BEGIN{RS="></td>\n"; FS="> | </|\""}{print $7, $2}' infile
every 2 lines as a record.