Find and update a numbering in a html file with awk - awk

I am trying to update a numbering/numeration in a test.html file:
<td class="no">(8)</td>
<td class="no">(9)</td>
<td class="no">(10)</td>
<td class="no">(11)</td>
<td class="no">(23)</td>
A new line could be added between the other lines, so I don't want to update the numeration always manually. Another condition is, that the update should start after number 7.
I tried to use gensub by replacing the line by the match but it doesn't work how I thought. There must be an easier way to determine the numbers! No tutorials or forum posts did help me or I didn't understand them...
So far what I have:
/^<td class="no">\([0-9]+\)<\/td>$/ {
a = gensub(/(.*)([0-9]+)(.*)/, "\\2", "g") # this finds only 1 digit, why?
if (a > 7) print a
}

If you only need to determine the numbers, you only must get rid of any character not being a digit
/^<td class="no">\([0-9]+\)<\/td>$/ {
gsub("[^0-9]","")
if ((0+$0) > 7) print
}
update: (0+$0) > 7 replaces my original $0 > 7 because the cygwing gawk does not compare $0 and 7 as numerical values but as string values --- I do not know why. I'm not familiar with cygwin.
This solution prints the following output:
8
9
10
11
23
If the test.html file had contained a line like
<td class="no">(71)</td>
the original code ($0 > 7) would have also print
71
in cygwin.

Related

AWK Line length [duplicate]

This question already has answers here:
Why does my tool output overwrite itself and how do I fix it?
(3 answers)
Are shell scripts sensitive to encoding and line endings?
(14 answers)
Closed 2 years ago.
I am having issues understanding why my code counts like it does. I use the following code to calculate the sum of lengths of all lines.
awk '{cnt += length($0)} END { print cnt/NR}' text.txt
In my text file i have the following.
hello
hellohello
There is no space between the sentences in the actual text file.
For example why would get i the value of 16 when i run code below and not 15
awk '{cnt += length($0)} END { print cnt }' text.txt
I understand that the count of 16 is divided by 2 because NR(numbers of lines)in my original count. But why does it count an extra character when i have 15 in the text file? When i edit my text file differently i get different results. If i end on a empty line(hit enter after "hellohello") it also counts that one towards the total count, then i would get 17.
Basically i need someone to help me and explain what exactly its counting and why.

Parsing and creating new arguments with getline AWK code

I am writing a pretty long AWK program (NOT terminal script) to parse through a network trace file. I have a situation where the next line in the trace file is ALWAYS a certain type of 'receive' (3 possible types) - however, I only want AWK to handle/print on one type. In short, I want to tell AWK if the next line contains a certain receive type, do not include it. It is my understanding that getline is the best way to go about this.
I have tried a couple different variations of getline and getline VAR via the manual, I still cannot seem to search through and reference fields in the next line like I want. Updated from edit:
if ((event=="r") && (hopSource == hopDest)) {
getline x
if ((x $31 =="arp") || (x $35 =="AODV")) {
#printf("Badline %s %s \n", $31, $35)
}
else {
macLinkRec++;
#printf("MAC Link Recieved from HEAD - %d to MEMBER %d \n", messageSource, messageDest)
}
}
I am using the print(badline) as just a marker to see what is going on. I fully understand how to restructure the code once I get the search and reference correct. I am also able to print the correct 'next' lines. However, I would expect to be able to search through the next line and create new arguments based on what is contained in the next line. How do I search a 'next line' based on an argument in AWK? How do I reference fields in that line to create new arguments?
Final note, the 'next line' number of fields (NF) varies, but I feel that the $35 field reference should handle any problems there.

Extract data between two tags

After searching and reading extensively, I managed to get half of the work done.
Here is the string:
<td class='bold vmiddle'> Owner CIDR: </td><td><span class='jtruncate-text'>42.224.0.0/12</span></td>
I need to extract the 42.224.0.0 and /12 to make a 42.224.0.0/12.
Now I managed to get 42.224.0.0 by using:
sed -n 's/^.*<a.href="[^"]*">\([^<]*\).*/\1/p'
but I'm at a loss how to extract /12.
Can anyone help?
You were pretty close:
sed -n 's/^.*<a.href="[^"]*">\([^<]*\)<\/a>\([^<]*\).*/\1\2/p' file
All that was needed was a 2nd capture group: <\/a> after the 1st one matches the closing tag for <a>, and the 2nd capture group, \([^<]*\), then captures everything up to but not including the closing </span> tag.
\1\2 in the replacement string simply concatenates what the two capture groups matched, yielding 42.224.0.0/12 with the sample input.
You can try below awk solution -
vipin#kali:~$ awk -F'>|<' '{print $(NF-6),$(NF-4)}' OFS="" kk.txt
42.224.0.0/12
Need to use multiple multiple(>,<) field seperators.

Awk Sum skipping Special Character Row

I am trying to take the sum of a particular column in a file i.e. column 18.
Using awk command along with Printf to display it in proper decimal format.
SUM=`cat ${INF_TARGET_FILE_PATH}/${EXTRACT_NAME}_${CURRENT_DT}.txt|awk -F"" '{s+=$18}END{printf("%24.2f\n", s)}'
Above command is skipping those rows in file which has the special character in one of the column 5 - RÉPARATIONS. Hence Awk skips these rows and doesnt consider sum for that row. Please help how to resolve this issue to take sum of all rows.
There is missing a back tics in your example, should be:
SUM=`cat ${INF_TARGET_FILE_PATH}/${EXTRACT_NAME}_${CURRENT_DT}.txt|awk -F"" '{s+=$18}END{printf("%24.2f\n", s)}'`
But you should not use back tics, you should use parentheses $(code)
Using cat to enter data to awk is also wrong way to do it, add pat after awk
SUM=$(awk -F"" '{s+=$18} END {printf "%24.2f\n",s}' ${INF_TARGET_FILE_PATH}/${EXTRACT_NAME}_${CURRENT_DT}.txt)
This may resolve your problem, but gives a more correct code.
If you give us your input file, it would help us to understand the problem.

How to skip records that turn on/off the range pattern?

gawk '/<Lexer>/,/<\/Lexer>/' file
this works but it prints the first and last records, which I'd like to omit. How to do so?
It says: "The record that turns on the range pattern and the one that turns it off both match the range pattern. If you don't want to operate on these records, you can write if statements in the rule's action to distinguish them from the records you are interested in." but no example.
I tried something like
gawk '/<Lexer>/,/<\/Lexer>/' {1,FNR-1} file
but it doesn't work.
If you have a better way to do this, without using awk, say so.
You can do it with 2 separate match statements and a variable
gawk '/<Lexer>/{p=1; next} /<\/Lexer>/ {p=0} p==1 {print}' file
This matches <Lexer> and sets p to 1 and then skips to the next line. While p is 1 it prints the current line. When it matches </Lexer> it sets p to 0 and skips. As p is 0 printing is suppressed.