Bash, awk, get specific string from file - awk

Issue
I am getting data via command awk from file, exactly string in "" from <a href="DATA">.
Source file.
...
<!-- Page 18 -->
<p style="position:absolute;top:956px;left:485px;white-space:nowrap" class="ft1829">145041</p>
<p style="position:absolute;top:586px;left:246px;white-space:nowrap" class="ft1829">145042</p>
<p style="position:absolute;top:156px;left:446px;white-space:nowrap" class="ft1829">440332</p>
<!-- Page 19 -->
<p style="position:absolute;top:1205px;left:53px;white-space:nowrap" class="ft1938"><b>1 790,- </b>| 457710</p>
<p style="position:absolute;top:1205px;left:634px;white-space:nowrap" class="ft1938"><b>2 290,- </b>| 464429</p>
<p style="position:absolute;top:924px;left:353px;white-space:nowrap" class="ft1938"><b>2 590,- </b>| 464430</p>
...
Command (with help on this forum).
awk '/Page/ {h=$3} /-- Page 1 --/ {h="Title"} /href=/ && h {split($0,a,"\"");print h","a[6]}'
Results.
...
18,145041
18,145042
18,440332
19,457710
19,464429
...
Problem is, when links are on the same line, data for only first link are processed.
Example.
` 457710</p> |  464429</p>`
Output.
...
18,457710,
...
Expected output.
...
18,457710,
18,464429,
...
What is wrong in awk command?
Thanks for any ideas.
Update 1
I need take all hrefs from this input.
<!-- Page 1 -->
<p style="position:absolute;top:397px;left:23px;white-space:nowrap" class="ft116">237002 | 237003</p>
<p style="position:absolute;top:831px;left:666px;white-space:nowrap" class="ft124">230041</p>
<p style="position:absolute;top:855px;left:447px;white-space:nowrap" class="ft116">467173</p>
<p style="position:absolute;top:910px;left:36px;white-space:nowrap" class="ft116">Hmotnost: 6 kg | 464431</p>
<!-- Page 2 -->
<p style="position:absolute;top:1176px;left:561px;white-space:nowrap" class="ft216">318417</p>
<p style="position:absolute;top:963px;left:561px;white-space:nowrap" class="ft216">338701</p>
...
Command.
awk 'match($0,/class=\"[a-zA-Z]+[0-9]+/){val=substr($0,RSTART,RLENGTH);sub(/[^0-9]*/,"",val)} match($0,/<a href=\"[0-9]+/){val1=substr($0,RSTART,RLENGTH);sub(/[^"]*\"/,"",val1);print substr(val,1,2)","val1}' test.html
Output.
11,237002
12,230041
11,467173
11,464431
21,318417
...
But I need this (for example 1,238003 is not present in result above, and first column page is different).
1,237002
1,237003
1,230041
1,467173
1,464431
2,318417
...
Thanks.

As the awk command will only process the first hyperlink on each line, just edit the file first to suit the awk command:
sed 's/\(a href=\)/\n\1/g' data-file | awk '/page/ ....'

Tested with given example, could you please try following.
awk '
{
gsub("</p> | ","&\n")
$1=$1
}
match($0,/class=\"[a-zA-Z]+[0-9]+/){
val=substr($0,RSTART,RLENGTH)
sub(/[^0-9]*/,"",val)
}
match($0,/<a href=\"[0-9]+/){
val1=substr($0,RSTART,RLENGTH)
sub(/[^"]*\"/,"",val1)
print substr(val,1,2)","val1
val=val1=""
}
' Input_file

Related

Replace only alphanumeric chars from strings in one file in another

I have file1 with records that I want to find and replace with # in file2 and redirect the output to file3. I want to translate only the alphanumeric characters in file2. With the below code I'm not able to get the expected output. What am I doing wrong?
file_read=`cat file2`
while read line; do
var=`echo $line | tr '[a-zA-Z0-9]' '#'`
rep=`echo $file_read | awk "{gsub(/$line/,\"$var\"); print}"`
done < file1
echo file2 > file3
cat file1
2001009
#vanti Finserv Co.
2001009
Fund #1
11:11 - Capital
MS&CO(NY)
American Friends Org, Inc. 12X32
Domain-Name (LLC)
MS&CO(NY)
MS&CO(NY)
Ivy/Estate Rd
E*Trade wholesale
cat file2
<html>
<body>
<hr><br><>span class="table">Records</span><table>
<tr class="column">
<td>Rec1</td>
<td>Rec2</td>
<td>Rec3</td>
<td>Rec4</td>
<td>Rec5</td>
<td>Rec6</td>
<td>Rec7</td>
<td>Rec8</td>
</tr>
<tr class="data">
<td>#vanti Finserv Co.</td>
<td>11:11 - Capital</td>
<td>MS&CO(NY)</td>
<td>New York</td>
<td>CDX98XSD</td>
<td>E*Trade wholesale</td>
<td>Domain-Name (LLC)</td>
<td>Ivy/Estate Rd</td>
<td></td>
</tr>
<tr class="data">
<td>#vanti Finserv Co.</td>
<td></td>
<td>MS&CO(NY)</td>
<td>2</td>
<td>2</td>
<td>MS&CO(NY)</td>
<td>MS&CO(NY)</td>
<td>Ivy/Estate Rd</td>
</table>
</body>
</html>
expected output
cat file3
<html>
<body>
<hr><br><>span class="table">Records</span><table>
<tr class="column">
<td>Rec1</td>
<td>Rec2</td>
<td>Rec3</td>
<td>Rec4</td>
<td>Rec5</td>
<td>Rec6</td>
<td>Rec7</td>
<td>Rec8</td>
</tr>
<tr class="data">
<td>###### ####### ##.</td>
<td>##:## - #######</td>
<td>##&##(##)</td>
<td>New York</td>
<td>CDX98XSD</td>
<td>#*##### ########</td>
<td>######-#### (###)</td>
<td>###/###### ##</td>
<td></td>
</tr>
<tr class="data">
<td>###### ####### ##.</td>
<td></td>
<td>##&##(##)</td>
<td>2</td>
<td>2</td>
<td>##&##(##)</td>
<td>##&##(##)</td>
<td>###/###### ##/td>
</table>
</body>
</html>
Would you please try the following:
awk '
NR==FNR {s = $0; gsub("[[:alnum:]]", "#"); a[s] = $0; next}
{
if (match($0, ">[^<]+")) {
str = substr($0, RSTART+1, RLENGTH-1)
if (str in a) {
$0 = substr($0, 1, RSTART) a[str] substr($0, RSTART+RLENGTH)
}
}
}
1 ' file1 file2 > file3
It assumes the strings to be replced are enclosed with tags but will work with the shown example.
You seem to be looking for something like
awk 'NR==FNR {
regex = $0;
gsub(/[][(){}|\\*+?.^$]/, "\\\\&", regex);
a[++n] = regex;
gsub(/[A-Za-z0-9]/, "#");
gsub(/&/, "\\\\&");
b[n] = $0;
next
}
{ for(i=1;i<=n;++i)
gsub(a[i], b[i])
} 1' file1 file2 >file3
In brief, we populate the array a with the phrases from file1, and b with the corresponding replacement strings. The condition FNR==NR will be true for the first input file; we then fall through to the rest of the script, which simply replaces any strings from a with the corresponding string from b, and prints all the lines.
The code is complicated somewhat by the escaping of regex metacharacters in a and further by the fact that & in the replacement string needs to be escaped, too (& alone recalls the matched text).
Demo: https://ideone.com/YkAkAZ
You generally want to avoid while read loops in the shell; Awk is much faster and more idiomatic when you want to perform some transformation on all lines in a file.
As a further aside, please try http://shellcheck.net/ before asking for human assistance. Even after you fixed syntax errors pointed out in comments, your attempt contains common beginner errors such as broken quoting.

Using awk to process html-related Gift-format Moodle questions

This is basically a awk question but it is about processing data for the Moodle
Gift format, thus the tags.
I want to format html code in a question (Moodle "test" activity) but I need to replace < and > with the corresponding entities, as these will be interpreted as "real" html, and not printed.
However, I want to be able to type the question with regular code and post-process the file before importing it as gift into Moodle.
I thought awk would be the perfect tool to do this.
Say I have this (invalid as such) Moodle/gift question:
::q1::[html]This is a question about HTML:
<pre>
<p>some text</p>
</pre>
and some tag:<code><img></code>
{T}
What I want is a script that translates this into a valid gift question:
::q1::[html]This is a question about HTML:
<pre>
<p>some text</p>
</pre>
and some tag:<code><img></code>
{T}
key point: replace < and > with < and > when:
inside a <pre>-</pre> bloc (assuming those tags are alone on a line)
between <code>and </code>, with arbitrary string in between.
For the first part, I'm fine. I have a shell script calling awk (gawk, actually).
awk -f process_src2gift.awk $1.src >$1.gift
with process_src2gift.awk:
BEGIN { print "// THIS IS A GENERATED FILE !" }
{
if( $1=="<pre>" ) # opening a "code" block
{
code=1;
print $0;
}
else
{
if( $1=="</pre>" ) # closing a "code" block
{
code=0;
print $0;
}
else
{ # if "code block", replace < > by html entities
if( code==1 )
{
gsub(">","\\>");
gsub("<","\\<");
}
print $0;
}
}
}
END { print "// END" }
However, I'm stuck with the second requirement..
Questions:
Is it possible to add to my awk script code to process the hmtl code inside the <code> tags? Any idea ? I thought about using sed but I didn't see how to do that.
Maybe awk isn't the right tool for that ? I'm open for any suggestion on other (standard Linux) tool.
Answering own question.
I found a solution by doing a two step awk process:
first step as described in question
second step by defining <code> or </code> as field delimiter, using a regex, and process the string replacement on second argument ($2).
The shell file becomes:
echo "Step 1"
awk -f process_src2gift.awk $1.src >$1.tmp
echo "Step 2"
awk -f process_src2gift_2.awk $1.tmp >$1.gift
rm $1.tmp
And the second awk file (process_src2gift_2.awk) will be:
BEGIN { FS="[<][/]?[c][o][d][e][>]"; }
{
gsub(">","\\>",$2);
gsub("<","\\<",$2);
if( NF >= 3 )
print $1 "<code>" $2 "</code>" $3
else
print $0
}
Of course, there are limitations:
no attributes in the <code> tag
only one pair <code></code> in the line
probably others...

How to get only first occurrence in log file using awk

i've a log file like this
some text line
other text line
<a>
<b>1</b>
<c>2</c>
</a>
another text line
<a>
<b>1</b>
<c>2</c>
</a>
yet another text line
I need to get only ther first occurrence of the XML "a":
<a>
<b>1</b>
<c>2</c>
</a>
I know
awk '/<a>/,/<\/a>/' file.log
will find all occurrences, how can I get just the first? (adding |head -n1 obvously doesn't work because it will capture only first line, and I can't know for sure how long "a" is because the awk expression must be generic because I've different log files with different "a" contents)
Another slight variation is to simply use a simple counter variable to indicate when you are in the first <a>...</a> block, outputting that block and then exiting afterwards. In your case using n as the variable to indicate in the first block, e.g.
awk -v n=0 '$1=="</a>" {print $1; exit} $1=="<a>" {n=1}; n==1' f.xml
Example Use/Output
With your input file as f.xml you would get:
$ awk -v n=0 '$1=="</a>" {print $1; exit} $1=="<a>" {n=1}; n==1' f.xml
<a>
<b>1</b>
<c>2</c>
</a>
(note: the {n=1} and n==1 rules rely on the default operation (print) to output the record)
This awk:
awk '
match($0,/<a>/) {
$0=substr($0,RSTART)
flag=1
}
match($0,/<\/a/) {
$0=substr($0,1,RSTART+RLENGTH)
print
exit
}
flag' file
can handle these forms:
The above awk handles this:
<a><b>1</b><c>2</c></a>
and this:
<a>
<b>1</b>
<c>2</c>
</a>
and also <a>
<b>1</b>
<c>2</c>
</a> this
the end
Another for GNU awk:
$ gawk -v RS="</?a>" '
NR==1 { printf RT }
NR==2 { print $0 RT }
' file
First:
$ awk '/<a>/{f=1} f; /<\/a>/{exit}' file
<a>
<b>1</b>
<c>2</c>
</a>
Last:
$ tac file | awk '/<\/a>/{f=1} f; /<a>/{exit}' | tac
<a>
<b>1</b>
<c>2</c>
</a>
Nth:
$ awk -v n=2 '/<a>/{c++} c==n{print; if (/<\/a>/) exit}' file
<a>
<b>1</b>
<c>2</c>
</a>

Awk insert text from a file before 2 specific lines

I have 2 files, file1 and file2.
# cat /tmp/file1
***** insert new text ****
# cat /tmp/file2
</table>
some text
</table>
<table name="test" description="test line">
some text
I want to insert the text from file1 into file2 but only before the following 2 lines:
</table>
<table name="test" description="test line">
So the end result is:
</table>
some text
*** insert new text ****
</table>
<table name="test" description="test line">
some text
Here is the awk statement/commands I am trying, but the problem is awk is inserting the new text for each match.
# f1="$(</tmp/file1)"
# awk -vf1="$f1" '/<\/table>/,/<table name="test" description="test line">/{print f1;print;next}1' /tmp/file2
***** insert new text ****
</table>
***** insert new text ****
some text
***** insert new text ****
</table>
***** insert new text ****
<table name="test" description="test line">
some text
How do I fix the awk statement to only insert the text from file1 before those specific 2 lines? Thanks in advance.
That works for me:
awk '{if(p=="</table>"&&$0=="<table name=\"test\" description=\"test line\">")
{system("cat file1");}if(p){print p}; p=$0}END{print $0}' file2
The if statement check if the current line matches <table...> and the previous line </table>. If yes, the contents of file1 is printed, else the line in file2 is printed.

Reading file in a pattern using awk

I have an input file in following manner
<td> Name1 </td>
<td> <span class="test">Link </span></td>
<td> Name2 </td>
<td> <span class="test">Link </span></td>
I want a awk script to read this file and output in following manner
url1 Name1
url2 Name2
Can anyone help me out in this trivial looking problem? Thanks.
Extracting one href per is relatively simple, so long as they conform to XHTML standards and there is only at most one on a line and you don't care about enclosing tags, but perl is easier:
$ perl -ne 'print "$1\n" if /href="([^"]+)"/'
If you care about enclosing tags or they are not standard conformant, you cannot use regular expressions to parse HTML. It is impossible.
added: oops, you do care about context, forget about regexps and use a real HTML parser
Here is an awk script that does the job
awk '
/a href=\".*\"/ { sub( /^.*a href=\"/,"" ); sub(/\".*/,""); print $0, name }
{ name = $2 }
'
this might work:
awk 'BEGIN
{i=1}{line[i++]=$0}
END
{
j=1;
while (j<i)
{print line[j+1] line[j]; j+=2}
}' yourfile|awk '{print substr($4,7,length($4)-6),$6}'
gawk '/^<td>/ {n = $2; getline; print gensub(/.*href="([^"]*).*/,"\\1",1), n}' infile
url1 Name1
url2 Name2
awk 'BEGIN{RS="></td>\n"; FS="> | </|\""}{print $7, $2}' infile
every 2 lines as a record.