Awk insert text from a file before 2 specific lines - awk

I have 2 files, file1 and file2.
# cat /tmp/file1
***** insert new text ****
# cat /tmp/file2
</table>
some text
</table>
<table name="test" description="test line">
some text
I want to insert the text from file1 into file2 but only before the following 2 lines:
</table>
<table name="test" description="test line">
So the end result is:
</table>
some text
*** insert new text ****
</table>
<table name="test" description="test line">
some text
Here is the awk statement/commands I am trying, but the problem is awk is inserting the new text for each match.
# f1="$(</tmp/file1)"
# awk -vf1="$f1" '/<\/table>/,/<table name="test" description="test line">/{print f1;print;next}1' /tmp/file2
***** insert new text ****
</table>
***** insert new text ****
some text
***** insert new text ****
</table>
***** insert new text ****
<table name="test" description="test line">
some text
How do I fix the awk statement to only insert the text from file1 before those specific 2 lines? Thanks in advance.

That works for me:
awk '{if(p=="</table>"&&$0=="<table name=\"test\" description=\"test line\">")
{system("cat file1");}if(p){print p}; p=$0}END{print $0}' file2
The if statement check if the current line matches <table...> and the previous line </table>. If yes, the contents of file1 is printed, else the line in file2 is printed.

Related

Replace only alphanumeric chars from strings in one file in another

I have file1 with records that I want to find and replace with # in file2 and redirect the output to file3. I want to translate only the alphanumeric characters in file2. With the below code I'm not able to get the expected output. What am I doing wrong?
file_read=`cat file2`
while read line; do
var=`echo $line | tr '[a-zA-Z0-9]' '#'`
rep=`echo $file_read | awk "{gsub(/$line/,\"$var\"); print}"`
done < file1
echo file2 > file3
cat file1
2001009
#vanti Finserv Co.
2001009
Fund #1
11:11 - Capital
MS&CO(NY)
American Friends Org, Inc. 12X32
Domain-Name (LLC)
MS&CO(NY)
MS&CO(NY)
Ivy/Estate Rd
E*Trade wholesale
cat file2
<html>
<body>
<hr><br><>span class="table">Records</span><table>
<tr class="column">
<td>Rec1</td>
<td>Rec2</td>
<td>Rec3</td>
<td>Rec4</td>
<td>Rec5</td>
<td>Rec6</td>
<td>Rec7</td>
<td>Rec8</td>
</tr>
<tr class="data">
<td>#vanti Finserv Co.</td>
<td>11:11 - Capital</td>
<td>MS&CO(NY)</td>
<td>New York</td>
<td>CDX98XSD</td>
<td>E*Trade wholesale</td>
<td>Domain-Name (LLC)</td>
<td>Ivy/Estate Rd</td>
<td></td>
</tr>
<tr class="data">
<td>#vanti Finserv Co.</td>
<td></td>
<td>MS&CO(NY)</td>
<td>2</td>
<td>2</td>
<td>MS&CO(NY)</td>
<td>MS&CO(NY)</td>
<td>Ivy/Estate Rd</td>
</table>
</body>
</html>
expected output
cat file3
<html>
<body>
<hr><br><>span class="table">Records</span><table>
<tr class="column">
<td>Rec1</td>
<td>Rec2</td>
<td>Rec3</td>
<td>Rec4</td>
<td>Rec5</td>
<td>Rec6</td>
<td>Rec7</td>
<td>Rec8</td>
</tr>
<tr class="data">
<td>###### ####### ##.</td>
<td>##:## - #######</td>
<td>##&##(##)</td>
<td>New York</td>
<td>CDX98XSD</td>
<td>#*##### ########</td>
<td>######-#### (###)</td>
<td>###/###### ##</td>
<td></td>
</tr>
<tr class="data">
<td>###### ####### ##.</td>
<td></td>
<td>##&##(##)</td>
<td>2</td>
<td>2</td>
<td>##&##(##)</td>
<td>##&##(##)</td>
<td>###/###### ##/td>
</table>
</body>
</html>
Would you please try the following:
awk '
NR==FNR {s = $0; gsub("[[:alnum:]]", "#"); a[s] = $0; next}
{
if (match($0, ">[^<]+")) {
str = substr($0, RSTART+1, RLENGTH-1)
if (str in a) {
$0 = substr($0, 1, RSTART) a[str] substr($0, RSTART+RLENGTH)
}
}
}
1 ' file1 file2 > file3
It assumes the strings to be replced are enclosed with tags but will work with the shown example.
You seem to be looking for something like
awk 'NR==FNR {
regex = $0;
gsub(/[][(){}|\\*+?.^$]/, "\\\\&", regex);
a[++n] = regex;
gsub(/[A-Za-z0-9]/, "#");
gsub(/&/, "\\\\&");
b[n] = $0;
next
}
{ for(i=1;i<=n;++i)
gsub(a[i], b[i])
} 1' file1 file2 >file3
In brief, we populate the array a with the phrases from file1, and b with the corresponding replacement strings. The condition FNR==NR will be true for the first input file; we then fall through to the rest of the script, which simply replaces any strings from a with the corresponding string from b, and prints all the lines.
The code is complicated somewhat by the escaping of regex metacharacters in a and further by the fact that & in the replacement string needs to be escaped, too (& alone recalls the matched text).
Demo: https://ideone.com/YkAkAZ
You generally want to avoid while read loops in the shell; Awk is much faster and more idiomatic when you want to perform some transformation on all lines in a file.
As a further aside, please try http://shellcheck.net/ before asking for human assistance. Even after you fixed syntax errors pointed out in comments, your attempt contains common beginner errors such as broken quoting.

remove above/below line and append to a file

I have a file with the following lines. I can filter a specific word and display the lines below/above it. However, i also wanted to remove it on the original file and append it to a new file.
<tr>
<td>tree</td><td>apple</td><td>red</td>
</tr>
<tr>
<td>tree</td><td>apple</td><td>green</td>
</tr>
<tr>
<td>tree</td><td>apple</td><td>red</td>
</tr>
<tr>
<td>tree</td><td>apple</td><td>red</td>
</tr>
i can do it this by: grep -i green origfile -A1 -B1 >> newfile but how can remove it from the orig file.
origfile:
<tr>
<td>tree</td><td>apple</td><td>red</td>
</tr>
<tr>
<td>tree</td><td>apple</td><td>red</td>
</tr>
<tr>
<td>tree</td><td>apple</td><td>red</td>
</tr>
newfile:
<tr>
<td>tree</td><td>apple</td><td>green</td>
</tr>
Is there a cleaner/quickest way to do it?
You could do it within a single awk, segregating records into different files. This will look for word green and will place one line before and after it and output it into new file along with removing it from original file.
awk '
FNR==NR{
if($0~/green/){
words[FNR]
}
next
}
((FNR+1) in words) || (FNR in words) || ((FNR-1) in words){
print > "newfile"
next
}
1
' Input_file Input_file > temp && mv temp Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
if($0~/green/){ ##Checking condition if line contains green string then do following.
words[FNR] ##Creating array of words with index of current line number.
}
next ##next will skip all further statements from here.
}
((FNR+1) in words) || (FNR in words) || ((FNR-1) in words){
##Checking condition if current line+1 OR current line OR current line-1 numbers are in words array then do following.
print > "newfile" ##Printing current line into newfile output file.
next ##next will skip all further statements from here.
}
1 ##Printing current line here.
' Input_file Input_file > temp && mv temp Input_file
##Mentioning Input_file(s) and doing inplace save into it.
$ cat tst.awk
$0 == "<tr>" { inRow=1; row=$0; next }
inRow {
row = row ORS $0
if ( $0 == "</tr>" ) {
inRow = 0
if ( index(row,"<td>green</td>") ) {
print row | "cat>&2"
next
}
else {
$0 = row
}
}
}
!inRow
$ awk -f tst.awk file >o1 2>o2
$ head o?
==> o1 <==
<tr>
<td>tree</td><td>apple</td><td>red</td>
</tr>
<tr>
<td>tree</td><td>apple</td><td>red</td>
</tr>
<tr>
<td>tree</td><td>apple</td><td>red</td>
</tr>
==> o2 <==
<tr>
<td>tree</td><td>apple</td><td>green</td>
</tr>
To modify the original file:
$ awk -f tst.awk file >o1 2>o2 && mv o1 file
$ cat file
<tr>
<td>tree</td><td>apple</td><td>red</td>
</tr>
<tr>
<td>tree</td><td>apple</td><td>red</td>
</tr>
<tr>
<td>tree</td><td>apple</td><td>red</td>
</tr>
Here is an ed solution.
#!/usr/bin/env bash
ed -s origfile.txt <<-EOF
/<td>green<\/td>/;?^<tr>?;/^<\/tr>/w newfile.txt
.;/^<\/tr>/d
w
q
EOF
Or a separate ed script, just name to script.ed
/<td>green<\/td>/;?^<tr>?;/^<\/tr>/w newfile.txt
.;/^<\/tr>/d
w
q
Then
ed -s origfile.txt < script.ed

How to get only first occurrence in log file using awk

i've a log file like this
some text line
other text line
<a>
<b>1</b>
<c>2</c>
</a>
another text line
<a>
<b>1</b>
<c>2</c>
</a>
yet another text line
I need to get only ther first occurrence of the XML "a":
<a>
<b>1</b>
<c>2</c>
</a>
I know
awk '/<a>/,/<\/a>/' file.log
will find all occurrences, how can I get just the first? (adding |head -n1 obvously doesn't work because it will capture only first line, and I can't know for sure how long "a" is because the awk expression must be generic because I've different log files with different "a" contents)
Another slight variation is to simply use a simple counter variable to indicate when you are in the first <a>...</a> block, outputting that block and then exiting afterwards. In your case using n as the variable to indicate in the first block, e.g.
awk -v n=0 '$1=="</a>" {print $1; exit} $1=="<a>" {n=1}; n==1' f.xml
Example Use/Output
With your input file as f.xml you would get:
$ awk -v n=0 '$1=="</a>" {print $1; exit} $1=="<a>" {n=1}; n==1' f.xml
<a>
<b>1</b>
<c>2</c>
</a>
(note: the {n=1} and n==1 rules rely on the default operation (print) to output the record)
This awk:
awk '
match($0,/<a>/) {
$0=substr($0,RSTART)
flag=1
}
match($0,/<\/a/) {
$0=substr($0,1,RSTART+RLENGTH)
print
exit
}
flag' file
can handle these forms:
The above awk handles this:
<a><b>1</b><c>2</c></a>
and this:
<a>
<b>1</b>
<c>2</c>
</a>
and also <a>
<b>1</b>
<c>2</c>
</a> this
the end
Another for GNU awk:
$ gawk -v RS="</?a>" '
NR==1 { printf RT }
NR==2 { print $0 RT }
' file
First:
$ awk '/<a>/{f=1} f; /<\/a>/{exit}' file
<a>
<b>1</b>
<c>2</c>
</a>
Last:
$ tac file | awk '/<\/a>/{f=1} f; /<a>/{exit}' | tac
<a>
<b>1</b>
<c>2</c>
</a>
Nth:
$ awk -v n=2 '/<a>/{c++} c==n{print; if (/<\/a>/) exit}' file
<a>
<b>1</b>
<c>2</c>
</a>

Bash, awk, get specific string from file

Issue
I am getting data via command awk from file, exactly string in "" from <a href="DATA">.
Source file.
...
<!-- Page 18 -->
<p style="position:absolute;top:956px;left:485px;white-space:nowrap" class="ft1829">145041</p>
<p style="position:absolute;top:586px;left:246px;white-space:nowrap" class="ft1829">145042</p>
<p style="position:absolute;top:156px;left:446px;white-space:nowrap" class="ft1829">440332</p>
<!-- Page 19 -->
<p style="position:absolute;top:1205px;left:53px;white-space:nowrap" class="ft1938"><b>1 790,- </b>| 457710</p>
<p style="position:absolute;top:1205px;left:634px;white-space:nowrap" class="ft1938"><b>2 290,- </b>| 464429</p>
<p style="position:absolute;top:924px;left:353px;white-space:nowrap" class="ft1938"><b>2 590,- </b>| 464430</p>
...
Command (with help on this forum).
awk '/Page/ {h=$3} /-- Page 1 --/ {h="Title"} /href=/ && h {split($0,a,"\"");print h","a[6]}'
Results.
...
18,145041
18,145042
18,440332
19,457710
19,464429
...
Problem is, when links are on the same line, data for only first link are processed.
Example.
` 457710</p> |  464429</p>`
Output.
...
18,457710,
...
Expected output.
...
18,457710,
18,464429,
...
What is wrong in awk command?
Thanks for any ideas.
Update 1
I need take all hrefs from this input.
<!-- Page 1 -->
<p style="position:absolute;top:397px;left:23px;white-space:nowrap" class="ft116">237002 | 237003</p>
<p style="position:absolute;top:831px;left:666px;white-space:nowrap" class="ft124">230041</p>
<p style="position:absolute;top:855px;left:447px;white-space:nowrap" class="ft116">467173</p>
<p style="position:absolute;top:910px;left:36px;white-space:nowrap" class="ft116">Hmotnost: 6 kg | 464431</p>
<!-- Page 2 -->
<p style="position:absolute;top:1176px;left:561px;white-space:nowrap" class="ft216">318417</p>
<p style="position:absolute;top:963px;left:561px;white-space:nowrap" class="ft216">338701</p>
...
Command.
awk 'match($0,/class=\"[a-zA-Z]+[0-9]+/){val=substr($0,RSTART,RLENGTH);sub(/[^0-9]*/,"",val)} match($0,/<a href=\"[0-9]+/){val1=substr($0,RSTART,RLENGTH);sub(/[^"]*\"/,"",val1);print substr(val,1,2)","val1}' test.html
Output.
11,237002
12,230041
11,467173
11,464431
21,318417
...
But I need this (for example 1,238003 is not present in result above, and first column page is different).
1,237002
1,237003
1,230041
1,467173
1,464431
2,318417
...
Thanks.
As the awk command will only process the first hyperlink on each line, just edit the file first to suit the awk command:
sed 's/\(a href=\)/\n\1/g' data-file | awk '/page/ ....'
Tested with given example, could you please try following.
awk '
{
gsub("</p> | ","&\n")
$1=$1
}
match($0,/class=\"[a-zA-Z]+[0-9]+/){
val=substr($0,RSTART,RLENGTH)
sub(/[^0-9]*/,"",val)
}
match($0,/<a href=\"[0-9]+/){
val1=substr($0,RSTART,RLENGTH)
sub(/[^"]*\"/,"",val1)
print substr(val,1,2)","val1
val=val1=""
}
' Input_file

Reading file in a pattern using awk

I have an input file in following manner
<td> Name1 </td>
<td> <span class="test">Link </span></td>
<td> Name2 </td>
<td> <span class="test">Link </span></td>
I want a awk script to read this file and output in following manner
url1 Name1
url2 Name2
Can anyone help me out in this trivial looking problem? Thanks.
Extracting one href per is relatively simple, so long as they conform to XHTML standards and there is only at most one on a line and you don't care about enclosing tags, but perl is easier:
$ perl -ne 'print "$1\n" if /href="([^"]+)"/'
If you care about enclosing tags or they are not standard conformant, you cannot use regular expressions to parse HTML. It is impossible.
added: oops, you do care about context, forget about regexps and use a real HTML parser
Here is an awk script that does the job
awk '
/a href=\".*\"/ { sub( /^.*a href=\"/,"" ); sub(/\".*/,""); print $0, name }
{ name = $2 }
'
this might work:
awk 'BEGIN
{i=1}{line[i++]=$0}
END
{
j=1;
while (j<i)
{print line[j+1] line[j]; j+=2}
}' yourfile|awk '{print substr($4,7,length($4)-6),$6}'
gawk '/^<td>/ {n = $2; getline; print gensub(/.*href="([^"]*).*/,"\\1",1), n}' infile
url1 Name1
url2 Name2
awk 'BEGIN{RS="></td>\n"; FS="> | </|\""}{print $7, $2}' infile
every 2 lines as a record.