sed multiline match combined with delete between two lines? - awk

Can I use sed's multiline match logic to delete all the lines between between a range of lines?
I have been trying all sorts of combinations of expressions with no luck.
Here is an example file ('creatures.txt'):
START TAG
species: swallowtail butterfly flying
legs
wings
head
END TAG
START TAG
species: common lizard running
legs
tail
head
END TAG
START TAG
species: peacock butterfly resting
legs
wings
head
END TAG
START TAG
species: blackbird flying
legs
wings
head
END TAG
I want do perform the following:
Whenever i encounter the following multiline match: (i) 'START' contained somewhere in the first line and (ii) 'butterfly' contained somewhere in the 2nd line
then delete all the lines between the 'START TAG' and 'END TAG' lines.
So using the example file above, the resulting output would be:
START TAG
END TAG
START TAG
species: common lizard
legs
tail
head
END TAG
START TAG
END TAG
START TAG
species: blackbird
legs
wings
head
END TAG
Thanks,
James.

Like this:
sed '/START TAG/{N;/butterfly/{:a;/END TAG/!{N;ba};d}}' file
Explanation:
# Enter block when 'START TAG' is found
/START TAG/ {
N # Append next line to the pattern buffer
# Enter block when 'butterfly' is found
/butterfly/ {
:a # Create a label (could be also :foo)
# Enter block unless 'END TAG' is found
/END TAG/! {
N # Append next line to pattern buffer
ba # branch back to label :a
}
# Once 'END TAG' is found
d # Delete the pattern buffer
}
}
PS: A slightly modified version can be used to keep the START / END tags:
sed '/START/{p;N;/butterfly/{:a;/END/!{N;ba};s/.*\n//}}' file

Any time you find yourself saying sed multiline you are looking at the wrong tool. sed is for doing s/old/new on individual lines, that is all. For anything else you should be using awk.
$ cat tst.awk
{ rec = (rec=="" ? "" : rec ORS) $0 }
/END/ {
numLines = split(rec,lines,ORS)
print (lines[2] ~ /butterfly/ ? lines[1] ORS lines[numLines] : rec)
rec = ""
}
$ awk -f tst.awk file
START TAG
END TAG
START TAG
species: common lizard running
legs
tail
head
END TAG
START TAG
END TAG
START TAG
species: blackbird flying
legs
wings
head
END TAG
The above will work using any awk in any shell on any UNIX box, is clear, simple, robust and easily modified if/when your requirements change later to look at or print any of the other lines in each record or any combinations of values in each line or across the whole record. For example, to generate a CSV of just records where the creature has wings:
$ cat tst.awk
{ rec = (rec=="" ? "" : rec ORS) $0 }
/END/ {
numLines = split(rec,lines,ORS)
if ( lines[4] == "wings" ) {
sub(/species: /,"",lines[2])
for (i=2; i<numLines; i++) {
printf "%s%s", lines[i], (i<(numLines-1) ? "," : ORS)
}
}
rec = ""
}
$ awk -f tst.awk file
swallowtail butterfly flying,legs,wings,head
peacock butterfly resting,legs,wings,head
blackbird flying,legs,wings,head
and anything else you might want to do is equally trivial and consistent.

Related

Awk start and end having same string

I would like to print
contents between start and end
contents between start and second occurrence of end
Unfortunately both start and end are having exactly the same value and also end is showing up twice in txt file
Sample file -
My
Dog
Start
has
a nice
tail
end
My
Dog
name
end
is
jay
awk '/Dog/, /Dog/ {print $0}' awktest.txt -> For grabbing contents between two Dog words
awk '/Start/, /end/ {print $0}' awktest.txt -> For grabbing contents between Start and second end
Could you please try following. Written based on shown samples.
awk '
/^start$/{
found=1
}
found;
/^end$/ && ++count==2{
found=""
}
' Input_file
Brief explanation: looking for line which has start in it then setting found to 1, then checking condition if found is set then print the line. Also checking condition if a line starts with end word and it's count is 2 then set found to NULL, so that we can stop printing from there.
Above will print lines with start and end too, in case you want to skip these lines then try following.
awk '
/^start$/{
found=1
next
}
/^end$/ && ++count==2{
found=""
}
found;
' Input_file
You can do both of these with a simple state machine with an echo variable e. The first (on one line):
pax> awk '/end/ {e = 0} e == 1 {print} /Start/ {e = 1}' inputFile
has
a nice
tail
Echo starts off then, for each line (order is important here):
an end line will turn echo off;
a line will print if echo is on;
a Start line will turn it on.
The second is similar but echo becomes a counter rather than a flag. That way, it only turns off on the second end:
pax> awk '/end/ {e -= 1} e > 0 {print} /Start/ {e = 2}'
has
a nice
tail
end
My
Dog
name
And, in fact, you can combine them if you're happy to supply the count (use 1, 2 or any other value you may need):
pax> awk -vc=2 '/end/ {e -= 1 } e > 0 {print} /Start/ {e = c}'
has
a nice
tail
end
My
Dog
name
You may need to watch out for edge cases such as what to do when Start appears within the section you're echoing. But that's just a matter of tweaking the state machine to detect that and act accordingly. At the moment, it will restart the counter. If you want it to not do that, use:
e == 0 && /Start/ {e = c}
for the final clause.
$ awk -v start=start -v end=end '$0~end && e++{exit} s; $0~start{s=1}' file
$ awk -v start=dog -v end=dog '...' file
will print between start and second occurence of end

Using awk to process html-related Gift-format Moodle questions

This is basically a awk question but it is about processing data for the Moodle
Gift format, thus the tags.
I want to format html code in a question (Moodle "test" activity) but I need to replace < and > with the corresponding entities, as these will be interpreted as "real" html, and not printed.
However, I want to be able to type the question with regular code and post-process the file before importing it as gift into Moodle.
I thought awk would be the perfect tool to do this.
Say I have this (invalid as such) Moodle/gift question:
::q1::[html]This is a question about HTML:
<pre>
<p>some text</p>
</pre>
and some tag:<code><img></code>
{T}
What I want is a script that translates this into a valid gift question:
::q1::[html]This is a question about HTML:
<pre>
<p>some text</p>
</pre>
and some tag:<code><img></code>
{T}
key point: replace < and > with < and > when:
inside a <pre>-</pre> bloc (assuming those tags are alone on a line)
between <code>and </code>, with arbitrary string in between.
For the first part, I'm fine. I have a shell script calling awk (gawk, actually).
awk -f process_src2gift.awk $1.src >$1.gift
with process_src2gift.awk:
BEGIN { print "// THIS IS A GENERATED FILE !" }
{
if( $1=="<pre>" ) # opening a "code" block
{
code=1;
print $0;
}
else
{
if( $1=="</pre>" ) # closing a "code" block
{
code=0;
print $0;
}
else
{ # if "code block", replace < > by html entities
if( code==1 )
{
gsub(">","\\>");
gsub("<","\\<");
}
print $0;
}
}
}
END { print "// END" }
However, I'm stuck with the second requirement..
Questions:
Is it possible to add to my awk script code to process the hmtl code inside the <code> tags? Any idea ? I thought about using sed but I didn't see how to do that.
Maybe awk isn't the right tool for that ? I'm open for any suggestion on other (standard Linux) tool.
Answering own question.
I found a solution by doing a two step awk process:
first step as described in question
second step by defining <code> or </code> as field delimiter, using a regex, and process the string replacement on second argument ($2).
The shell file becomes:
echo "Step 1"
awk -f process_src2gift.awk $1.src >$1.tmp
echo "Step 2"
awk -f process_src2gift_2.awk $1.tmp >$1.gift
rm $1.tmp
And the second awk file (process_src2gift_2.awk) will be:
BEGIN { FS="[<][/]?[c][o][d][e][>]"; }
{
gsub(">","\\>",$2);
gsub("<","\\<",$2);
if( NF >= 3 )
print $1 "<code>" $2 "</code>" $3
else
print $0
}
Of course, there are limitations:
no attributes in the <code> tag
only one pair <code></code> in the line
probably others...

Sed between blocks of some exact text

I am struggling on parsing some log file.
Here how it looks like:
node_name: na2-devdb-cssx
run_id: 3c3424f3-8a62-4f4c-b97a-2096a2afc070
start_time: 2015-06-26T21:00:44Z
status: failure
node_name: eu1-devsx
run_id: f5ed13a3-1f02-490f-b518-97de9649daf5
start_time: 2015-06-26T21:00:34Z
status: success
I need to get blocks which have "failure" in its last line of the block.
Ideally would be to consider on time stamp as well. Like if time stamp is like "2015-06-26T2*"
And here what I have tried so far:
sed -e '/node_name/./failure/p' /file
sed -n '/node_name/./failure/p' /file
awk '/node_name/,/failure/' file
sed -e 's/node_name\(.*\)failure/\1/' file
None of them doesn't work for me.
It just throws me everything except failure...
For example:
[root#localhost chef-repo-zilliant]# sed -n '/node_name/,/failure/p' /tmp/run.txt | head
node_name: eu1-devdb-linc
run_id: e49fe64d-567d-4627-a10d-477e17fb6016
start_time: 2015-06-28T20:59:55Z
status: success
node_name: eu1-devjs1
run_id: c6c7f668-b912-4459-9d56-94d1e0788802
start_time: 2015-06-28T20:59:53Z
status: success
Have no idea why it doesn't work. Seems like for all around these methods work fine...
Thank you in advance.
A way with Gnu sed:
sed -n ':a;/^./{H;n;ba;};x;/2015-06-26T21/{/failure$/p;};' file.txt
details:
:a; # define the label "a"
/^./ { # condition: when a line is not empty
H; # append it to the buffer space
n; # load the next line in the pattern space
ba; # go to label "a"
};
x; # swap buffer space and pattern space
/2015-06-26T21/ { # condition: if the needed date is in the block
/failure$/ p; # condition: if "failure" is in the block then print
};
I noted you tried with awk, although you only tagged the question with sed, so I will add a solution with it.
You can play with built-in variable that control how to split lines and fields, like:
awk '
BEGIN { RS = ""; FS = OFS = "\n"; ORS = "\n\n" }
$NF ~ /failure/ && $(NF-1) ~ /2015-06-26T2/ { print }
' infile
RS = "" separates records in newlines. FS and OFS separates fields in lines, and ORS is to print output like original input, with a line interleaved.
It yields:
node_name: na2-devdb-cssx
run_id: 3c3424f3-8a62-4f4c-b97a-2096a2afc070
start_time: 2015-06-26T21:00:44Z
status: failure
Use grep.
grep -oPz '\bnode_name:(?:(?!\n\n)[\s\S])*?2015-06-26T2(?:(?!\n\n)[\s\S])*?\bfailure\b' file
The main part here is (?:(?!\n\n)[\s\S])*? which matches any charactar but not of a blank line, zero or more times.

Removing Quote From Field For Filename Using AWK

I've been playing around with this for an hour trying to work out how to embed the removal of quotes from a specific field using AWK.
Basically, the file encapsulates text in quotes, but I want to use the second field to name the file and split them based on the first field.
ID,Name,Value1,Value2,Value3
1,"AAA","DEF",1,2
1,"AAA","GGG",7,9
2,"BBB","DEF",1,2
2,"BBB","DEF",9,0
3,"CCC","AAA",1,1
What I want to get out are three files, all with the header row named:
AAA [1].csv
BBB [2].csv
CCC [3].csv
I have got it all working, except for the fact that I can't for the life of me work out how to remove the quotes around the filename!!
So, this command does everything (except the file is named with quotes around $2, but I need to do some kind of transformation on $2 before it goes into evname. In the actual file, I want to keep the encapsulating quotes.
awk -F, 'NR==1{h=$0;next}!($1 in files){evname=$2" ["$1"].csv";files[$1]=1;print h>evname}{print > evname}' DataExtract.csv
I've tried to push a gsub into this, but I'm struggling to work out exactly how this should look.
This is I think as close as I have got, but it is just calling everything "2" for $2, I'm not sure if this means I need to do an escape of $2 somehow in the gsub, but trying that doesn't seem to be working, so I'm at a loss as to what I'm doing wrong.
awk -F, 'NR==1{h=$0;next}!($1 in files){evname=gsub(""\","", $2)" - Event ID ["$1"].csv";files[$1]=1;print h>evname}{print > evname}' DataExtract.csv
Any help greatly appreciated.
Thanks in advance!!
Gannon
If I understand what you are attempting correctly, then
awk -F, 'NR==1{h=$0;next}!($1 in files){gsub(/"/, "", $2); evname=$2" ["$1"].csv";files[$1]=1;print h>evname}{print > evname}' DataExtract.csv
should work. That is
NR == 1 {
h = $0;
next
}
!($1 in files) {
stub = $2 # <-- this is the new bit: make a working copy
# of $2 (so that $2 is unchanged and the line
# is not rebuilt with changes for printing),
gsub(/"/, "", stub) # remove the quotes from it, and
evname = stub " [" $1 "].csv" # use it to assemble the filename.
files[$1] = 1;
print h > evname
}
{
print > evname
}
You can, of course, use
evname = stub " - Event ID [" $1 "].csv"
or any other format after the substitution (this one seems to be what you tried to get in your second code snippet).
The gsub function returns the number of substitutions made, not the result of the substitutions; that is why evname=gsub(""\","", $2)" - Event ID ["$1"].csv" does not work.
Things are always clearer with a little white space:
awk -F, '
NR==1 { hdr=$0; next }
!seen[$1]++ {
evname = $2
gsub(/"/,"",evname)
outfile = evname " [" $1 "].csv"
print hdr > outfile
}
{ print > outfile }
' DataExtract.csv
Aside: It's pretty unusual for someone to WANT to create files with spaces in their names given the complexity that introduces in any later scripts you write to process them. You sure you want to do that?
P.S. here's the gawk version as suggested by #JID below
awk -F, '
NR==1 { hdr=$0; next }
!seen[$1]++ {
outfile = gensub(/"/,"","g",$2) " [" $1 "].csv"
print hdr > outfile
}
{ print > outfile }
' DataExtract.csv
Apply the gsub before you make the assignment:
awk -F, 'NR==1{h=$0;next}
!($1 in files){
gsub("\"","",$2); # Add this line
evname=$2" ["$1"].csv";files[$1]=1;print...

Fields contain field separator as string: How to apply awk correctly in this case?

I have a CSV-file similar to this test.csv file:
Header 1; Header 2; Header 3
A;B;US
C;D;US
E;F;US
G;H;FR
I;J;FR
K;L;FR
M;"String with ; semicolon";UK
N;"String without semicolon";UK
O;"String OK";
P;"String OK";
Now, I want to split this file based on header 3. So I want to end up with four separate CSV files, one for "US", "FR", "UK", and "".
With my very limited Linux command line skills (sadly :-( I used until now this line:
awk -F\; 'NR>1{ fname="country_yearly_"$3".csv"; print >>(fname); close(fname);}' test.csv
Of course, the experienced command line users of you will notice my problem: One field in my test.csv contains rows in which the semicolon which I use as a separator is also present in fields that are marked with quotation marks (I can't guarantee that for sure because of millions of rows, but I'm happy with an answer that assumes this). So sadly, I get an additional file named country_yearly_ semicolon".csv, which contains this row in my example.
In my venture to solve this issue, I came across this question on SO. In particular, Thor's answer seems to contain the solution of my problem by replacing all semicolons in strings. I adjusted his code accordingly as follows:
awk -F'"' -v OFS='' '
NF > 1 {
for(i=2; i<=NF; i+=2) {
gsub(";", "|", $i);
$i = FS $i FS; # reinsert the quotes
}
print
}' test.csv > test1.csv
Now, I get the following test1.csv file:
M;"String with | semicolon";UK
N;"String without semicolon";UK
O;"String OK";
P;"String OK";
As you can see, all rows that have quotation marks are shown and my problem line is fixed as well, but a) I actually want all rows, not only those in quotation marks and I can't figure out which part in his code does limit the rows to ones with quotation marks and b) I think it would be more efficient if test.csv is just changed instead of sending the output to a new file, but I don't know how to do that either.
EDIT in response to Birei's answer:
Unfortunately, my minimal example was too simple. Here is an updated version:
Header 1; Header 2; Header 3; Header 4
A;B;US;
C;D;US;
E;F;US;
G;H;FR;
I;J;FR;
K;L;FR;
M;"String with ; semicolon";UK;"Yet another ; string"
N;"String without semicolon";UK; "No problem here"
O;"String OK";;"Fine"
P;"String OK";;"Not ; fine"
Note that my real data has roughly 100 columns and millions of rows and the country column, ignoring semicolons in strings, is column 13. However, as far as I see it I can't use the fact that it's column 13 if I don't get rid of the semicolons in strings first.
To split the file, you might just do:
awk -v FS=";" '{ CSV_FILE = "country_yearly_" $NF ".csv" ; print > CSV_FILE }'
Which always take the last field to construct the file name.
In your example, only lines with quotation marks are printed due to the NF > 1 pattern. The following script will print all lines:
awk -F'"' -v OFS='' '
NF > 1 {
for(i=2; i<=NF; i+=2) {
gsub(";", "|", $i);
$i = FS $i FS; # reinsert the quotes
}
}
{
# print all lines
print
}' test.csv > test1.csv
To do what you want, you could change the line in the script and reprocess it:
awk -F'"' -v OFS='' '
# Save the original line
{ ORIGINAL_LINE = LINE = $0 }
# Replace the semicolon inside quotes by a dummy character
# and put the resulting line in the LINE variable
NF > 1 {
LINE = ""
for(i=2; i<=NF; i+=2) {
gsub(";", "|", $i)
LINE = LINE $(i-1) FS $i FS # reinsert the quotes
}
# Add the end of the line after the last quote
if ( $(i+1) ) { LINE = LINE $(i+1) }
}
{
# Put the semicolon-separated fields in a table
# (the semicolon inside quotes have been removed from LINE)
split( LINE, TABLE, /;/ )
# Build the file name -- TABLE[ 3 ] is the 3rd field
CSV_FILE = "country_yearly_" TABLE[ 3 ] ".csv"
# Save the line
print ORIGINAL_LINE > CSV_FILE
}'
You were near of a solution. I would use the last field to avoid the problem of fields with double quotes. Also, there is no need to close each file. They will automatically be closed by the shell at the end of the awk script.
awk '
BEGIN {
FS = OFS = ";";
}
FNR > 1 {
fname = "country_yearly_" $NF ".csv";
print >>fname;
}
' infile
Check output:
head country_yearly_*
That yields:
==> country_yearly_.csv <==
O;"String OK";
P;"String OK";
==> country_yearly_FR.csv <==
G;H;FR
I;J;FR
K;L;FR
==> country_yearly_UK.csv <==
M;"String with ; semicolon";UK
N;"String without semicolon";UK
==> country_yearly_US.csv <==
A;B;US
C;D;US
E;F;US