I have the following contents in my file file.txt
Start
1
2
3
5
end
Start
a
b
c
d
end
How do i use only awk to get the section which is at the end from "start" to "end" as follows?
Start
a
b
c
d
end
Tried efforts:
awk '/Start/ {{ f = 1;n++ }} f && n == 2; /end/ {{ f = 0 }}' file.txt
With tac + awk solution, could you please try following.
tac Input_file | awk '/^end/{found=1} found; /^Start/{exit}' | tac
Explanation: tac will print Input_file in reverse order(from bottom to up manner), then passing its output to awk command and in awk code printing from first occurrence of end to till first occurrence of start. Exiting from awk code when first occurrence of start is found, again this output is being send to tac which will reverse the output and post it in original form of Input_file.
2nd solution: Using GNU awk one could try like but it assumes that there is NO discrepancy(means after each start there is end keyword present in OP's Input_file else it will give false positive results) in start and end keywords appearance.
awk -v RS= '{sub(/.*Start/,"Start")} 1' Input_file
You may use this awk:
awk '$1 == "Start" { s = ""; p = 1 }
p { s = s $0 ORS }
$1 == "end" { p = 0 }
END { printf "%s", s }' file
Start
a
b
c
d
end
Related
is there a way in bash to print lines from one match to another, unless third match is between those lines? Let's say file is:
A
B
C
D
E
A
B
Z
C
D
And I want to print all the lines between "A" and "C", but not those containing "Z", so output should be:
A
B
C
I'm using this part of code to match lines between "A" and "C":
awk '/C/{p=0} /A/{p=1} p'
With your shown samples, please try following awk code.
awk '
/A/ { found=1 }
/C/ && !noVal && found{
print value ORS $0
found=noVal=value=""
}
found && /Z/{ noVal=1 }
found{
value=(value?value ORS:"")$0
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/A/ { found=1 } ##Checking condition if line has A then set found to 1.
/C/ && !noVal && found{ ##Checking if C is found and noVal is NULL and found is set then do following.
print value ORS $0 ##printing value ORS and current line here.
found=noVal=value="" ##Nullifying found,noVal and value here.
}
found && /Z/{ noVal=1 } ##Checking if found is SET and Z is found then set noVal here.
found{ ##Checking if found is set here.
value=(value?value ORS:"")$0 ##Creating value which has current line in it and keep concatenating values to it.
}
' Input_file ##Mentioning Input_file name here.
This uses full-line string matching instead of the partial-line regexp matching used in your question and the other answers posted so far (see how-do-i-find-the-text-that-matches-a-pattern for the difference) as I expect it's what you should really be using for a robust solution:
$ cat tst.awk
$0 == "A" { inBlock=1 }
inBlock {
rec = rec $0 ORS
if ( $0 == "C" ) {
if ( !index(ORS rec ORS, ORS "Z" ORS) ) {
printf "%s", rec
}
rec = ""
inBlock = 0
}
}
$ awk -f tst.awk file
A
B
C
If you REALLY wanted to continue to use partial-line regexp matching that'd be this:
$ cat tst.awk
/A/ { inBlock=1 }
inBlock {
rec = rec $0 ORS
if ( /C/ ) {
if ( rec !~ /Z/ ) {
printf "%s", rec
}
rec = ""
inBlock = 0
}
}
but that's fragile if your real data isn't just single letters on their own lines.
I would use A as record separator and C as field separator. So, the A to C range would be in $1 (except the A itself at the beginning and the C at the end) and the rest up to the next A in $2.
The trick is to print only the first field $1 if it doesn't contain any Z. Skip the first record that will be empty.
So try:
awk 'BEGIN{RS="A";FS="C"}(NR > 1) && !/Z/{print "A" $1 "C"}' inputfile
Or even better, according to a comment of Ed Morton below:
awk 'BEGIN{RS="A";FS="C"}(NR > 1) && !/Z/{print RS $1 FS}' inputfile
If the Z can occur after C, we will have to correct the code.
You can use sed to do what you described:
sed '/^A$/,/^C$/!d; /^Z$/d' example-data
# gives
A
B
C
A
B
C
!d means delete lines which don't match the address.
Your expected result was three lines though. So you could use sed '/^A$/,/^C$/!d; /^Z$/d; /^C$/q'. Or, sed '/^A$/,/^C$/!d; /^Z$/d' | sort -u?
I would like to print
contents between start and end
contents between start and second occurrence of end
Unfortunately both start and end are having exactly the same value and also end is showing up twice in txt file
Sample file -
My
Dog
Start
has
a nice
tail
end
My
Dog
name
end
is
jay
awk '/Dog/, /Dog/ {print $0}' awktest.txt -> For grabbing contents between two Dog words
awk '/Start/, /end/ {print $0}' awktest.txt -> For grabbing contents between Start and second end
Could you please try following. Written based on shown samples.
awk '
/^start$/{
found=1
}
found;
/^end$/ && ++count==2{
found=""
}
' Input_file
Brief explanation: looking for line which has start in it then setting found to 1, then checking condition if found is set then print the line. Also checking condition if a line starts with end word and it's count is 2 then set found to NULL, so that we can stop printing from there.
Above will print lines with start and end too, in case you want to skip these lines then try following.
awk '
/^start$/{
found=1
next
}
/^end$/ && ++count==2{
found=""
}
found;
' Input_file
You can do both of these with a simple state machine with an echo variable e. The first (on one line):
pax> awk '/end/ {e = 0} e == 1 {print} /Start/ {e = 1}' inputFile
has
a nice
tail
Echo starts off then, for each line (order is important here):
an end line will turn echo off;
a line will print if echo is on;
a Start line will turn it on.
The second is similar but echo becomes a counter rather than a flag. That way, it only turns off on the second end:
pax> awk '/end/ {e -= 1} e > 0 {print} /Start/ {e = 2}'
has
a nice
tail
end
My
Dog
name
And, in fact, you can combine them if you're happy to supply the count (use 1, 2 or any other value you may need):
pax> awk -vc=2 '/end/ {e -= 1 } e > 0 {print} /Start/ {e = c}'
has
a nice
tail
end
My
Dog
name
You may need to watch out for edge cases such as what to do when Start appears within the section you're echoing. But that's just a matter of tweaking the state machine to detect that and act accordingly. At the moment, it will restart the counter. If you want it to not do that, use:
e == 0 && /Start/ {e = c}
for the final clause.
$ awk -v start=start -v end=end '$0~end && e++{exit} s; $0~start{s=1}' file
$ awk -v start=dog -v end=dog '...' file
will print between start and second occurence of end
awk -F, '{if ($2 == 0) awk '{ total += $3; count++ } END { print total/count }' CLN_Tapes_LON; }' /tmp/CLN_Tapes_LON
awk: {if ($2 == 0) awk {
awk: ^ syntax error
bash: count++: command not found
Just for fun, let's look at what's wrong with your original version and transform it into something that works, step by step. Here's your initial version (I'll call it version 0):
awk -F, '{if ($2 == 0) awk '{ total += $3; count++ } END { print total/count }' CLN_Tapes_LON; }' /tmp/CLN_Tapes_LON
The -F, sets the field separator to be the comma character, but your later comment seems to indicate that the columns (fields) are separated by spaces. So let's get rid of it; whitespace-separation is what awk expects by default. Version 1:
awk '{if ($2 == 0) awk '{ total += $3; count++ } END { print total/count }' CLN_Tapes_LON; }' /tmp/CLN_Tapes_LON
You seem to be attempting to nest a call to awk inside your awk program? There's almost never any call for that, and this wouldn't be the way to do it anyway. Let's also get rid of the mismatched quotes while we're at it: note in passing that you cannot nest single quotes inside another pair of single quotes that way: you'd have to escape them somehow. But there's no need for them at all here. Version 2:
awk '{if ($2 == 0) { total += $3; count++ } END { print total/count } }' /tmp/CLN_Tapes_LON
This is close but not quite right: the END block is only executed when all lines of input are finished processing: it doesn't make sense to have it inside an if. So let's move it outside the braces. I'm also going to tighten up some whitespace. Version 3:
awk '{if ($2==0) {total+=$3; count++}} END{print total/count}' /tmp/CLN_Tapes_LON
Version 3 actually works, and you could stop here. But awk has a handy way of specifying to run a block of code only against lines that match a condition: 'condition {code}' So yours can more simply be written as:
awk '$2==0 {total+=$3; count++} END{print total/count}' /tmp/CLN_Tapes_LON
... which, of course, is pretty much exactly what John1024 suggested.
$ awk '$2 == 0 { total += $3; count++;} END { print total/count; }' CLN_Tapes_LON
3
This assumes that your input file looks like:
$ cat CLN_Tapes_LON
CLH040 0 3
CLH041 0 3
CLH042 0 3
CLH043 0 3
CLH010 1 0
CLH011 1 0
CLH012 1 0
CLH013 1 0
CLH130 1 40
CLH131 1 40
CLH132 1 40
CLH133 1 40
Thought I'd try to do this without awk. Awk is clearly the better choice, but it's still a one-liner.
bc<<<"($(grep ' 0 ' file|tee >(wc -l>i)|cut -d\ -f3|tr '\n' '+')0)/"$(<i)
3
It extracts lines with 0 in the second column with grep. This is passed to tee for wc -l to count the lines and to cut to extract the third column. tr replaces the new lines with "+" which is put over the number of lines (i.e., "12 / 4"). This is then passed to bc.
If we have an input:
TargetIDs,CPD,Value,SMILES
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
95,CPD-3333333,-1,c1ccccc1N
Now we would like to separate the duplicates and non-duplicates based on the fourth column (smiles)
duplicate:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate
95,CPD-3333333,-1,c1ccccc1N
Now the following attempt could do separate the duplicate without any problem. However, the first occurrence of the duplicate will still be included into the non-duplicate file.
BEGIN { FS = ","; f1="a"; f2="b"}
{
# Keep count of the fields in fourth column
count[$4]++;
# Save the line the first time we encounter a unique field
if (count[$4] == 1)
first[$4] = $0;
# If we encounter the field for the second time, print the
# previously saved line
if (count[$4] == 2)
print first[$4] > f1 ;
# From the second time onward. always print because the field is
# duplicated
if (count[$4] > 1)
print > f1;
if (count[$4] == 1) #if (count[$4] - count[$4] == 0) <= change to this doesn't work
print first[$4] > f2;
duplicate output results from the attempt:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate output results from the attempt
TargetIDs,CPD,Value,SMILES
95,CPD-3333333,-1,c1ccccc1N
95,CPD-1111111,-2,c1ccccc1
May I know if any guru might have comments/solutions? Thanks.
I would do this:
awk '
NR==FNR {count[$2] = $1; next}
FNR==1 {FS=","; next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' <(cut -d, -f4 input | sort | uniq -c) input
The process substitution will pre-process the file and perform a count on the 4th column. Then, you can process the file and decide if that line is "duplicated".
All in awk: Ed Morton shows a way to collect the data in a single pass. Here's a 2 pass solution that's virtually identical to my example above
awk -F, '
NR==FNR {count[$NF]++; next}
FNR==1 {next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' input input
Yes, the input file is given twice.
$ cat tst.awk
BEGIN{ FS="," }
NR>1 {
if (cnt[$4]++) {
dups[$4] = nonDups[$4] dups[$4] $0 ORS
delete nonDups[$4]
}
else {
nonDups[$4] = $0 ORS
}
}
END {
print "Duplicates:"
for (key in dups) {
printf "%s", dups[key]
}
print "\nNon Duplicates:"
for (key in nonDups) {
printf "%s", nonDups[key]
}
}
$ awk -f tst.awk file
Duplicates:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
Non Duplicates:
95,CPD-3333333,-1,c1ccccc1N
This solution only works if the duplicates are grouped together.
awk -F, '
function fout( f, i) {
f = (cnt > 1) ? "dups" : "nondups"
for (i = 1; i <= cnt; ++i)
print lines[i] > f
}
NR > 1 && $4 != lastkey { fout(); cnt = 0 }
{ lastkey = $4; lines[++cnt] = $0 }
END { fout() }
' file
Little late
My version in awk
awk -F, 'NR>1{a[$0":"$4];b[$4]++}
END{d="\n\nnondupe";e="dupe"
for(i in a){split(i,c,":");b[c[2]]==1?d=d"\n"i:e=e"\n"i} print e d}' file
Another built similar to glenn jackmans but all in awk
awk -F, 'function r(f) {while((getline <f)>0)a[$4]++;close(f)}
BEGIN{r(ARGV[1])}{output=(a[$4] == 1 ? "nondup" : "dup");print >output} ' file
I need to print out only one of various consecutive lines with same first field, and the one must be the one with "more fields in its last field". That means that last field is a set of words, and I need to print the line with more elements in its last field. In case of same number of max elements in last field, any of the max is ok.
Example input:
("aborrecimento",[Noun],[Masc],[Reg:Sing],[Bulk])
("aborrecimento",[Noun],[Masc],[Reg:Sing],[Device,Concrete,Count])
("aborrecimento",[Noun],[Masc],[Reg:Sing],[])
("adiamento",[Noun],[Masc],[Reg:Sing],[])
("adiamento",[Noun],[Masc],[Reg:Sing],[Count])
("adiamento",[Noun],[Masc],[Reg:Sing],[VerbNom])
Example output:
("aborrecimento",[Noun],[Masc],[Reg:Sing],[Device,Concrete,Count])
("adiamento",[Noun],[Masc],[Reg:Sing],[VerbNom])
solution with awk would be nice, but no need of one liner.
generate index file
$ cat input.txt |
sed 's/,\[/|[/g' |
awk -F'|' '
{if(!gensub(/[[\])]/, "", "g", $NF))n=0;else n=split($NF, a, /,/); print NR,$1,n}
' |
sort -k2,2 -k3,3nr |
awk '$2!=x{x=$2;print $1}' >idx.txt
content of index file
$ cat idx.txt
2
5
select lines
$ awk 'NR==FNR{idx[$0]; next}; (FNR in idx)' idx.txt input.txt
("aborrecimento",[Noun],[Masc],[Reg:Sing],[Device,Concrete,Count])
("adiamento",[Noun],[Masc],[Reg:Sing],[Count])
Note: no space in input.txt
Use [ as the field delimiter, then split the last field on ,:
awk -F '[[]' '
{split($NF, f, /,/)}
length(f) > max[$1] {line[$1] = $0; max[$1] = length(f)}
END {for (l in line) print line[l]}
' filename
Since order is important, an update:
awk -F '[[]' '
{split($NF, f, /,/)}
length(f) > max[$1] {line[$1] = $0; max[$1] = length(f); nr[$1] = NR}
END {for (l in line) printf("%d\t%s\n", nr[$1], line[l])}
' filename |
sort -n |
cut -f 2-
Something like this might work:
awk 'BEGIN {FS="["}
Ff != gensub("^([^,]+).*","\\1","g",$0) { Ff = gensub("^([^,]+).*","\\1","g",$0) ; Lf = $NF ; if (length(Ml) > 0) { print Ml } }
Ff == gensub("^([^,]+).*","\\1","g",$0) { if (length($NF) > length(Lf)) { Lf=$NF ; Ml=$0 } }
END {if (length(Ml) > 0) { print Ml } }' INPUTFILE
See here in action. BUT it's not the solution you want to use, as this is rather a hack. And it fails you if you meant that your last field is longer if it contains more , separated elements than the length of your last element. (E.g. the above script happily reports [KABLAMMMMMMMMMMM!] as longer than [A,B,C].)
This might work for you:
sort -r file | sort -t, -k1,1 -u