I have an input file in following manner
<td> Name1 </td>
<td> <span class="test">Link </span></td>
<td> Name2 </td>
<td> <span class="test">Link </span></td>
I want a awk script to read this file and output in following manner
url1 Name1
url2 Name2
Can anyone help me out in this trivial looking problem? Thanks.
Extracting one href per is relatively simple, so long as they conform to XHTML standards and there is only at most one on a line and you don't care about enclosing tags, but perl is easier:
$ perl -ne 'print "$1\n" if /href="([^"]+)"/'
If you care about enclosing tags or they are not standard conformant, you cannot use regular expressions to parse HTML. It is impossible.
added: oops, you do care about context, forget about regexps and use a real HTML parser
Here is an awk script that does the job
awk '
/a href=\".*\"/ { sub( /^.*a href=\"/,"" ); sub(/\".*/,""); print $0, name }
{ name = $2 }
'
this might work:
awk 'BEGIN
{i=1}{line[i++]=$0}
END
{
j=1;
while (j<i)
{print line[j+1] line[j]; j+=2}
}' yourfile|awk '{print substr($4,7,length($4)-6),$6}'
gawk '/^<td>/ {n = $2; getline; print gensub(/.*href="([^"]*).*/,"\\1",1), n}' infile
url1 Name1
url2 Name2
awk 'BEGIN{RS="></td>\n"; FS="> | </|\""}{print $7, $2}' infile
every 2 lines as a record.
Related
Hello I am trying to find a pattern match on some HTML files using AWK but i dont seem to have any luck with it
So for my pattern to match it should have the following
<tr>
<td>Failures</td>
<td>0</td>
</tr>
<tr>
<td>Warnings</td>
<td>4</td>
</tr>
<tr>
<td>Errors</td>
<td>0</td>
</tr>
<tr>
<td>Not Applicable</td>
<td>53</td>
</tr>
<tr>
<td>Manual Checks</td>
<td>9</td>
</tr>
Failures and Manual Checks should be zero. So in the above file failures is 0 and manual check is 9. So i need to match only when failure is 0 and manual check is 0.
SO i tried with and without escaping the new line but awk is not returning any results.
find . -name "*.html" -exec awk '/td\>Failures\<\/td\>\\n.*\<td\>0/ {print FILENAME}' '{}' \;
I have also tried other combinations like below but cant seem to figure out why awk is not going to the next line.
find . -name "*.html" -exec awk '/td\>Failures\<\/td\>\\n\[\^\\\<\]\+\<td\>0/ {print FILENAME}' '{}' \;
Can anyone please have a look and tell me what i am missing?
If your html files are well-formed xml, then xmlstarlet will work:
find . -name '*.html' \
-exec xmlstarlet sel -t \
--if '//tr[td[1] = "Failures" and td[2] = "0"]' \
--if '//tr[td[1] = "Manual Checks" and td[2] = "0"]' \
--inp-name --nl \
'{}' \;
if there's a row where the first cell is Failures and the second cell is 0,
and if there's a row where the first cell is Manual Checks and the second cell is 0,
then print the input filename and a newline.
A more reliable solution is going to be based on a tool designed to parse html; having said that ...
One awk idea using a couple custom regex patterns:
$ cat regex.awk
BEGIN { RS="^$" # whole file treated as a single line of input
regex1="<td>Manual Checks</td>[[:space:]]+<td>0</td>"
regex2="<td>Failures</td>[[:space:]]+<td>0</td>"
}
$0 ~ regex1 && $0 ~ regex2 {print FILENAME}
NOTE: placing the code in a file (regex.awk) will make the follow-on find/awk quite a bit cleaner
Sample input:
$ cat f1.html
... snip ...
<td>Failures</td>
<td>0</td> # match
... snip ...
<td>Manual Checks</td>
<td>9</td> # not a match
... snip ...
$ cat f2.html
... snip ...
<td>Failures</td>
<td>0</td> # match
... snip ...
<td>Manual Checks</td>
<td>0</td> # match
... snip ...
NOTE: comments added for clarification; comments to not exist in the actual files
Adding this to a find call:
$ find . -name "f?.html" -exec awk -f regex.awk '{}' \;
./f2.html
Using any awk in any shell on every Unix box:
$ cat tst.awk
gsub("^[[:space:]]*<td>|</td>[[:space:]]*$","") {
if ( ++cnt % 2 ) {
tag = $0
}
else {
f[tag] = $0+0
}
}
END {
if ( (f["Failures"] == 0) && (f["Manual Checks"] == 0) ) {
print FILENAME
}
}
$ awk -f tst.awk file
The above creates an array f[] that maps the tags (names) of the cells to their values so then in the END section you can do whatever test you like on whatever combination of them you like.
I have file1 with records that I want to find and replace with # in file2 and redirect the output to file3. I want to translate only the alphanumeric characters in file2. With the below code I'm not able to get the expected output. What am I doing wrong?
file_read=`cat file2`
while read line; do
var=`echo $line | tr '[a-zA-Z0-9]' '#'`
rep=`echo $file_read | awk "{gsub(/$line/,\"$var\"); print}"`
done < file1
echo file2 > file3
cat file1
2001009
#vanti Finserv Co.
2001009
Fund #1
11:11 - Capital
MS&CO(NY)
American Friends Org, Inc. 12X32
Domain-Name (LLC)
MS&CO(NY)
MS&CO(NY)
Ivy/Estate Rd
E*Trade wholesale
cat file2
<html>
<body>
<hr><br><>span class="table">Records</span><table>
<tr class="column">
<td>Rec1</td>
<td>Rec2</td>
<td>Rec3</td>
<td>Rec4</td>
<td>Rec5</td>
<td>Rec6</td>
<td>Rec7</td>
<td>Rec8</td>
</tr>
<tr class="data">
<td>#vanti Finserv Co.</td>
<td>11:11 - Capital</td>
<td>MS&CO(NY)</td>
<td>New York</td>
<td>CDX98XSD</td>
<td>E*Trade wholesale</td>
<td>Domain-Name (LLC)</td>
<td>Ivy/Estate Rd</td>
<td></td>
</tr>
<tr class="data">
<td>#vanti Finserv Co.</td>
<td></td>
<td>MS&CO(NY)</td>
<td>2</td>
<td>2</td>
<td>MS&CO(NY)</td>
<td>MS&CO(NY)</td>
<td>Ivy/Estate Rd</td>
</table>
</body>
</html>
expected output
cat file3
<html>
<body>
<hr><br><>span class="table">Records</span><table>
<tr class="column">
<td>Rec1</td>
<td>Rec2</td>
<td>Rec3</td>
<td>Rec4</td>
<td>Rec5</td>
<td>Rec6</td>
<td>Rec7</td>
<td>Rec8</td>
</tr>
<tr class="data">
<td>###### ####### ##.</td>
<td>##:## - #######</td>
<td>##&##(##)</td>
<td>New York</td>
<td>CDX98XSD</td>
<td>#*##### ########</td>
<td>######-#### (###)</td>
<td>###/###### ##</td>
<td></td>
</tr>
<tr class="data">
<td>###### ####### ##.</td>
<td></td>
<td>##&##(##)</td>
<td>2</td>
<td>2</td>
<td>##&##(##)</td>
<td>##&##(##)</td>
<td>###/###### ##/td>
</table>
</body>
</html>
Would you please try the following:
awk '
NR==FNR {s = $0; gsub("[[:alnum:]]", "#"); a[s] = $0; next}
{
if (match($0, ">[^<]+")) {
str = substr($0, RSTART+1, RLENGTH-1)
if (str in a) {
$0 = substr($0, 1, RSTART) a[str] substr($0, RSTART+RLENGTH)
}
}
}
1 ' file1 file2 > file3
It assumes the strings to be replced are enclosed with tags but will work with the shown example.
You seem to be looking for something like
awk 'NR==FNR {
regex = $0;
gsub(/[][(){}|\\*+?.^$]/, "\\\\&", regex);
a[++n] = regex;
gsub(/[A-Za-z0-9]/, "#");
gsub(/&/, "\\\\&");
b[n] = $0;
next
}
{ for(i=1;i<=n;++i)
gsub(a[i], b[i])
} 1' file1 file2 >file3
In brief, we populate the array a with the phrases from file1, and b with the corresponding replacement strings. The condition FNR==NR will be true for the first input file; we then fall through to the rest of the script, which simply replaces any strings from a with the corresponding string from b, and prints all the lines.
The code is complicated somewhat by the escaping of regex metacharacters in a and further by the fact that & in the replacement string needs to be escaped, too (& alone recalls the matched text).
Demo: https://ideone.com/YkAkAZ
You generally want to avoid while read loops in the shell; Awk is much faster and more idiomatic when you want to perform some transformation on all lines in a file.
As a further aside, please try http://shellcheck.net/ before asking for human assistance. Even after you fixed syntax errors pointed out in comments, your attempt contains common beginner errors such as broken quoting.
Here is my file that has each line as mentioned below:
<span class="bcd">abc</span><span class="icon"></span>
here is expected output:
href="abc.com"
aria-label="abc ofe"
class="abc"
class="bcd"
class="icon"
Here is what I got:
awk '{for(i=1; i<=NF; ++i)printf "%s%s", $i, (i<NFi?FS:(i<NF?"\n":RS))}'
echo "<span class="bcd">abc</span><span class="icon"></span>" | awk '{for(i=1; i<=NF; ++i)printf "%s%s", $i, (i<NFi?FS:(i<NF?"\n":RS))}'
gives me:
<a
href=abc.com
aria-label=abc
ofe
class=abc><span
class=bcd>abc</span><span
class=icon></span></a>
Trying to get "attribute" string before double quote
and "attribute value" string between double quote.
Need to use awk or sed for macOS.
With your shown samples, you could try following awk code. Simple explanation would be, set RS(record separator) to different values(shown by OP required in output) and then print respective values.
awk -v RS='href="[^"]*"|aria-label="[^"]*"|class="[^"]*"' 'RT{print RT}' Input_file
With GNU awk and a regex:
awk '{$1=$1}1' FPAT='[^= ]+="[^"]+"' OFS="\n" file
Output:
href="abc.com"
aria-label="abc ofe"
class="abc"
class="bcd"
class="icon"
FPAT: A regular expression describing the contents of the fields in a record. When set, gawk parses the input into fields, where the fields
match the regular expression, instead of using the value of FS as the field separator.
See: man awk and 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
Try this Perl solution:
$ cat sharad.txt
<span class="bcd">abc</span><span class="icon"></span>
$ perl -0777 -ne ' while(/\<([^>]+)>/sg){ $a=$1; while($a=~/(\S+?\".+?\")/mg) { print "$1\n" } } ' sharad.txt
href="abc.com"
aria-label="abc ofe"
class="abc"
class="bcd"
class="icon"
$
i've a log file like this
some text line
other text line
<a>
<b>1</b>
<c>2</c>
</a>
another text line
<a>
<b>1</b>
<c>2</c>
</a>
yet another text line
I need to get only ther first occurrence of the XML "a":
<a>
<b>1</b>
<c>2</c>
</a>
I know
awk '/<a>/,/<\/a>/' file.log
will find all occurrences, how can I get just the first? (adding |head -n1 obvously doesn't work because it will capture only first line, and I can't know for sure how long "a" is because the awk expression must be generic because I've different log files with different "a" contents)
Another slight variation is to simply use a simple counter variable to indicate when you are in the first <a>...</a> block, outputting that block and then exiting afterwards. In your case using n as the variable to indicate in the first block, e.g.
awk -v n=0 '$1=="</a>" {print $1; exit} $1=="<a>" {n=1}; n==1' f.xml
Example Use/Output
With your input file as f.xml you would get:
$ awk -v n=0 '$1=="</a>" {print $1; exit} $1=="<a>" {n=1}; n==1' f.xml
<a>
<b>1</b>
<c>2</c>
</a>
(note: the {n=1} and n==1 rules rely on the default operation (print) to output the record)
This awk:
awk '
match($0,/<a>/) {
$0=substr($0,RSTART)
flag=1
}
match($0,/<\/a/) {
$0=substr($0,1,RSTART+RLENGTH)
print
exit
}
flag' file
can handle these forms:
The above awk handles this:
<a><b>1</b><c>2</c></a>
and this:
<a>
<b>1</b>
<c>2</c>
</a>
and also <a>
<b>1</b>
<c>2</c>
</a> this
the end
Another for GNU awk:
$ gawk -v RS="</?a>" '
NR==1 { printf RT }
NR==2 { print $0 RT }
' file
First:
$ awk '/<a>/{f=1} f; /<\/a>/{exit}' file
<a>
<b>1</b>
<c>2</c>
</a>
Last:
$ tac file | awk '/<\/a>/{f=1} f; /<a>/{exit}' | tac
<a>
<b>1</b>
<c>2</c>
</a>
Nth:
$ awk -v n=2 '/<a>/{c++} c==n{print; if (/<\/a>/) exit}' file
<a>
<b>1</b>
<c>2</c>
</a>
Here is a (real-world) text:
<tr>
randomtext
ip_(45.54.58.85)
randomtext..
port(randomtext45)
randomtext random...
</tr>
<tr>
randomtext ran
ip_(5.55.45.8)
randomtext4
port(other$_text_other_length444)
</tr>
<tr>
randomtext
random
port(other$text52)
</tr>
output should be:
45.54.58.85 45
5.55.45.8 444
I know how to grep 45.54.58.85 and 5.55.45.8
awk 'BEGIN{ RS="<tr>"}1' file | grep -oP '(?<=ip_\()[^)]*'
how to grep port taking into account that we have random text/length after port( ?
I put a third record that should not appear in the output as there is no ip
Using GNU Awk:
gawk 'BEGIN { RS = "<tr>" } match($0, /.*^ip_[(]([^)]+).*^port[(].*[^0-9]+([0-9]+)[)].*/, a) { print a[1], a[2] }' your_file
And another that's compatible with any Awk:
awk -F '[()]' '$1 == "<tr>" { i = 0 } $1 == "ip_" { i = $2 } $1 == "port" && i { sub(/.*[^0-9]/, "", $2); if (length($2)) print i, $2 }' your_file
Output:
45.54.58.85 45
5.55.45.8 444
Through GNU awk , grep and paste.
$ awk 'BEGIN{ RS="<tr>"}/ip_/{print;}' file | grep -oP 'ip_\(\K[^)]*|port\(\D*\K\d+' | paste - -
45.54.58.85 45
5.55.45.8 444
Explanation:
awk 'BEGIN{ RS="<tr>"}/ip_/{print;}' file with the Record Separator value as <tr>, this awk command prints only the record which contains the string ip_
ip_\(\K[^)]* prints only the text which was just after to ip_( upto the next ) symbol. \K in the pattern discards the previously matched characters.
| Logical OR symbol.
port\(\D*\K\d+ Prints only the numbers which was inside port() string.
paste - - combine every two lines.
Here is another awk
awk -F"[()]" '/^ip/ {ip=$2;f=NR} f && NR==f+2 {n=split($2,a,"[a-z]+");print ip,a[n]}' file
45.54.58.85 45
5.55.45.8 444
How it works:
awk -F"[()]" ' # Set field separator to "()"
/^ip/ { # If line starts with "ip" do
ip=$2 # Set "ip" to field $2
f=NR} # Set "f" to line number
f && NR==f+2 { # Go two line down and
n=split($2,a,"[a-z]+") # Split second part to get port
print ip,a[n] # Print "ip" and "port"
}' file # Read the file
WIth any modern awk:
$ awk -F'[()]' '
$1=="ip_" { ip=$2 }
$1=="port" { sub(/.*[^[:digit:]]/,"",$2); port=$2 }
$1=="</tr>" { if (ip) print ip, port; ip="" }
' file
45.54.58.85 45
5.55.45.8 444
Couldn't be much simpler and clearer IMHO.