Delete lines by multiple patterns in specific range of lines - awk

I have the following (simplified) file:
<RESULTS>
<ROW>
<COLUMN NAME="TITLE">title 1</COLUMN>
<COLUMN NAME="VERSION">1,3</COLUMN>
</ROW>
<ROW>
<COLUMN NAME="TITLE">title 1</COLUMN>
<COLUMN NAME="VERSION">1,1</COLUMN>
</ROW>
<ROW>
<COLUMN NAME="TITLE">title 1</COLUMN>
<COLUMN NAME="VERSION">1,2</COLUMN>
</ROW>
</RESULTS>
What I am trying to achieve is to delete all ROW elements that match on the title, but do not match on the latest VERSION (in this case 1,3).
So, what I have in mind is something like the following with sed:
sed -i '/<ROW>/,/<\/ROW>/<COLUMN NAME=\"TITLE\">title 1.*<COLUMN NAME=\"VERSION\">^1,3<\/COLUMN>/d' file
The expected output should be the following:
<RESULTS>
<ROW>
<COLUMN NAME="TITLE">title 1</COLUMN>
<COLUMN NAME="VERSION">1,3</COLUMN>
</ROW>
</RESULTS>
Unfortunately, this did not work, neither did anything that I tried. I searched a lot for similar issues, but nothing worked for me.
Is there a way of achieving it with any Linux command line utility (sed, awk, etc)?
Thanks a lot in advance.

/<ROW>/,/<\/ROW>/ won't work, because sed uses greedy matching; it matches everything from the first /<ROW>/ to the last /<\/ROW>/.
You'll have to use one of the advanced features of sed. The simplest is probably the hold space.
This:
sed -n '/<ROW>/{h;d;};H;`
will store an entire ROW block in the hold space, and overwrite it when it encounters a new ROW block. (And print nothing.)
This:
sed -n '/<ROW>/{h;d;};H;/<\/ROW>/{g;p;}
will store the entire ROW block, then print it out when it is complete.
This:
sed -n '/<ROW>/{h;d;};H;/<\/ROW>/{g;/title 1/!d;p;}'
will do the same, but will delete a block that does not contain "title 1".
This:
sed -n '/<ROW>/{h;d;};H;/<\/ROW>/{g;/title 1/!d;/1,3/p;}'
will do the same, but print only if the block contains "1,3". (You can spell out the matching lines more explicitly; I'm trying to keep this code concise.)

This might work for you (GNU sed):
sed '/<ROW>/{:a;N;/<\/ROW>/!ba;/TITLE.*title 1/!b;/VERSION.*1,3/b;d}' file
Gather up lines between <ROW> and </ROW>.
If the lines collected don't contain the correct title, bail out.
If the lines collected do contain the correct version bail out.
Otherwise delete the lines collected.

Related

Ensuring that opening data before awk condition is dealt with

I have XML which contains:
</body></text></xml>
<?xml version="1.0" encoding="utf-8"><?xml-stylesheet type="text/xsl" href="stylesheetv1_1.xsl" ?><text><body>
I need to split the file at each XML declaration.
I've been trying the following awk line, but it fails, and I don't know why. Any help gratefully received.
awk '/<?xml v/{filename=NR".xml"}; {print >filename}' sourcefile.xml
where sourcefile.xml contains the data to be split.
I thought it might be an issue with escaping the question mark, but that seems not to be the issue.
The xml tag is preceded by \r\n
I'm using Gitbash for Windows.
What I need to end up with are a load of separate files, all of which end with
</body></text></xml>
and begin with
<?xml version="1.0" etc
The shell responds with 'expression for `>' redirection has null string value' but I'm afraid I'm not sure what that means. I also get no output files at all.
The error you are getting means that your redirection out to a file is pointing to a filename that is undefined. Your filename variable is blank at some point during script execution.
Try setting that filename variable at the BEGIN block of the awk script to insure that the records occuring before your first "<?xml v" match has somewhere to go:
awk 'BEGIN{filename="prexmlgarbage.xml"} /<\?xml v/{filename=NR".xml"}; {print >filename}' sourcefile.xml
I've also added an escape character before the question mark so you are properly matching on the string <?xml v
You could also put a condition before your print block, if you don't want to capture records before your first "<?xml v" hit:
awk '/<\?xml v/{filename=NR".xml"}; filename!=""{print >filename}' sourcefile.xml

Sed script needed to insert LF before each time match in a large single string

I have lengthy string that I need to put a line feed before each instance of a time stamp.
03:38:11,03/07/2017,node,cpu,user,sys,idle,intr/s,ctxt/s,0,0,0,9,91,0,1,0,24,75,0,total,0,17,83,2370,3574,1,0,3,4,
93,1,1,10,4,86,1,total,7,4,89,2922,4653,03:39:11,03/07/2017,node,cpu,user,sys,idle,intr/s,ctxt/s,0,0,4,25,71,0,1,5
,16,79,0,total,4,21,75,2487,3876,1,0,0,3,97,1,1,1,1,98,1,total,1,2,98,2880,4728,03:40:11,03/07/2017,node,cpu,user,
sys,idle,intr/s,ctxt/s,0,0,1,30,69,0,1,1,30,69,0,total,1,30,69,3237,4344,1,0,3,49,47,1,1,10,47,43,1,total,6,48,45,
3920,5702,
I need to see about formatting it as such:
03:38:11,03/07/2017,node,cpu,user,sys,idle,intr/s,ctxt/s,0,0,0,9,91,0,1,0,24,75,0,total,0,17,83,2370,3574,1,0,3,4,93,1,1,10,4,86,1,total,7,4,89,2922,4653,
03:39:11,03/07/2017,node,cpu,user,sys,idle,intr/s,ctxt/s,0,0,4,25,71,0,1,5,16,79,0,total,4,21,75,2487,3876,1,0,0,3,97,1,1,1,1,98,1,total,1,2,98,2880,4728,
03:40:11,03/07/2017,node,cpu,user,sys,idle,intr/s,ctxt/s,0,0,1,30,69,0,1,1,30,69,0,total,1,30,69,3237,4344,1,0,3,49,47,1,1,10,47,43,1,total,6,48,45,3920,5702,
I am currently trying to use the following:
sed -e 's/^[[:digit:]][[:digit:]]\:[[:digit:]][[:digit:]]/\n&/g' cpu.log
The ^ line anchor forces sed to only match the first date stamp. Remove it and you should be fine.
To avoid roplacing the first, maybe massage the script to require something before the match (hard-coding a comma would seem to work, based on your sample data); or just post-process the output to remove the first newline.
sed 's/[0-9][0-9]:[0-9][0-9]:[0-9][0-9]/\n&/g'

Extract data between two tags

After searching and reading extensively, I managed to get half of the work done.
Here is the string:
<td class='bold vmiddle'> Owner CIDR: </td><td><span class='jtruncate-text'>42.224.0.0/12</span></td>
I need to extract the 42.224.0.0 and /12 to make a 42.224.0.0/12.
Now I managed to get 42.224.0.0 by using:
sed -n 's/^.*<a.href="[^"]*">\([^<]*\).*/\1/p'
but I'm at a loss how to extract /12.
Can anyone help?
You were pretty close:
sed -n 's/^.*<a.href="[^"]*">\([^<]*\)<\/a>\([^<]*\).*/\1\2/p' file
All that was needed was a 2nd capture group: <\/a> after the 1st one matches the closing tag for <a>, and the 2nd capture group, \([^<]*\), then captures everything up to but not including the closing </span> tag.
\1\2 in the replacement string simply concatenates what the two capture groups matched, yielding 42.224.0.0/12 with the sample input.
You can try below awk solution -
vipin#kali:~$ awk -F'>|<' '{print $(NF-6),$(NF-4)}' OFS="" kk.txt
42.224.0.0/12
Need to use multiple multiple(>,<) field seperators.

How do you add a numbered or bulleted list in a table cell entry or line breaks

Docbook 5 trying to add a simplelist or single lines of text in a table row cell entry but cannot find the secret combination of allowed elements. Is it possible?
This is valid according to my editor:
<informaltable>
<tgroup cols="1">
<tbody>
<row>
<entry>
<para>A single line of text</para>
<simplelist>
<member>Apiary</member>
<member>Beekeeper</member>
</simplelist>
</entry>
</row>
</tbody>
</tgroup>
</informaltable>

Oracle SQL*Loader getting CDATA values

Anybody knows how to do this? I know there's a better way of loading XML data to Oracle without using SQL*Loader, but I'm just curious on how this is done using it. I have already a code that can load XML data to the DB, however, it wont run if the XML file has values that contain a CDATA...
Below is the control file code which works if the values are not CDATA...
LOAD DATA
INFILE FRATS.xml "str '</ROW>'"
APPEND
INTO TABLE "FRATERNITIES"
(
DUMMY FILLER TERMINATED BY "<ROW>",
THE_CODE SEQUENCE (MAX, 1),
DUMMY2 FILLER TERMINATED BY "</COLUMN>",
STORE_NN_KJ ENCLOSED BY '<COLUMN NAME="THE_NAME">' AND '</COLUMN>',
STAFF_COUNT ENCLOSED BY '<COLUMN NAME="THE_COUNT">' AND '</COLUMN>'
)
Here's the XML file:
<?xml version='1.0' encoding='MS932' ?>
<RESULTS>
<ROW>
<COLUMN NAME="THE_CODE">777</COLUMN>
<COLUMN NAME="THE_NAME">CharlieOscarDelta</COLUMN>
<COLUMN NAME="THE_COUNT">24</COLUMN>
</ROW>
</RESULTS>
Here's the XML file with CDATA values. My control file will not run with it...:
<?xml version='1.0' encoding='MS932' ?>
<RESULTS>
<ROW>
<COLUMN NAME="THE_CODE"><![CDATA[777]]></COLUMN>
<COLUMN NAME="THE_NAME"><![CDATA[CharlieOscarDelta]]></COLUMN>
<COLUMN NAME="THE_COUNT"><![CDATA[24]]></COLUMN>
</ROW>
</RESULTS>
have you tried
STORE_NN_KJ "substr(substr(:STORE_NN_KJ,instr(:STORE_NN_KJ,'<![CDATA[')+9),0,instr(substr(:STORE_NN_KJ,instr(:STORE_NN_KJ,'<![CDATA[')+9),']]>'))" ENCLOSED BY '<COLUMN NAME="THE_NAME">' AND '</COLUMN>'
EDIT
Looks like I forgot a ).. Try this..