add suffix to duplicate fields in one column - awk

example input file contains rows with tab delimited fields:
raw1 aaa wer
raw2 bbb dfg
raw3 ccc fgh
raw4 ccc etr
raw5 aaa cbg
raw6 aaa dfg
I need to add suffix (dupl) to second column if it's value occurs in 2nd column of any other line too
Expected output for above input is:
raw1 aaa(dupl) wer
raw2 bbb dfg
raw3 ccc(dupl) fgh
raw4 ccc(dupl) etr
raw5 aaa(dupl) cbg
raw6 aaa(dupl) dfg
For now I have following code but it doesn't work efficiently with huge input files:
cut -d$'\t' -f2 input|\
sort|\
uniq -c|\
awk '$1>1{print $2}'|\
while read dup;do gawk -F$'\t' -i inplace -va=$dup '$2==a{$2=a"(dupl)"}1' OFS=$'\t' input;done
What would be the easier way to achieve the same with single awk or sed command, and without multiple pipes as above?

I would harness GNU AWK for this task following way, let file.txt content be
raw1 aaa wer
raw2 bbb dfg
raw3 ccc fgh
raw4 ccc etr
raw5 aaa cbg
raw6 aaa dfg
then
awk 'BEGIN{FS=OFS="\t"}FNR==NR{arr[$2]+=1;next}arr[$2]>1{$2=$2 "(dupl)"}{print}' file.txt file.txt
gives output
raw1 aaa(dupl) wer
raw2 bbb dfg
raw3 ccc(dupl) fgh
raw4 ccc(dupl) etr
raw5 aaa(dupl) cbg
raw6 aaa(dupl) dfg
Explanation: I inform GNU AWK that tab character (\t) is both field separator (FS) and output field separator (OFS) and I ran file.txt twice, first pass is used to build array arr so keys are 2nd fields and values numbers of their appearance, then in second pass where there were more than 1 occurrence I add (dupl) as suffix to 2nd field, for every line I print it.
(tested in GNU Awk 5.0.1)

Related

How to print lines before and after a match until a specific match (3 matching patterns)

I have a lenghty data, which are built in blocks.
in the example below let's see that they start with (AAA) and end with (FFF) between them they could have many lines of information
I want to extract specific blocks, only if the pattern (CCC) is inside these blocks.
An example would be:
cat text
AAA1
BBB
FFF1
AAA2
BBB
CCC2
DDD
EEE
FFF2
AAA3
BBB
FFF3
AAA4
BBB
CCC4
DDD
EEE
FFF4
The output should be:
AAA2
BBB
CCC2
DDD
EEE
FFF2
AAA4
BBB
CCC4
DDD
EEE
FFF4
I thought on using sed, but not really working:
If use this only gives me from CCC to the next AAA/FFF: sed -n -e '/CCC/,/AAA/ p' text or sed -n -e '/CCC/,/AAA/ p' text
CCC2
DDD
EEE
FFF2
AAA3
CCC4
DDD
EEE
FFF4
if I use it this way: sed -n -e '/AAA/,/FFF/ p' text I will capture patterns between AAA and FFF that not has CCC in it.
This might work for you (GNU sed):
sed -n '/AAA/{:a;N;/FFF/!ba;/CCC/p}' file
Turn off implicit printing -n because this is a filtering operation.
Match a line containing AAA and append further lines until one containing FFF.
If the collection contains the string CCC, print it.
Repeat.
N.B. This assumes AAA and FFF are paired, if not use:
sed -n '/AAA/{:a;N;/\n.*AAA/s/.*\n//;/FFF/!ba;/CCC/p}' file
Alternative:
sed -n 'H;/AAA/h;/FFF/{g;/AAA.*CCC/p;z;h}' file
EDIT:
For AAA ,CCC and FFF at the beginning of a line,use:
sed -n '/^AAA/{:a;N;/^FFF/M!ba;/^CCC/Mp}' file
or
sed -n '/^AAA/{:a;N;/\nAAA/s/.*\n//;/\nFFF/!ba;/\nCCC/p}' file
or
sed -n 'H;/^AAA/h;/^FFF/{g;/AAA.*\nCCC/p;z;h}' file
Using any awk in any shell on every Unix box:
$ awk '/^AAA/{a=1; buf=""} /^CCC/{c=1} {buf=buf $0 ORS} /^FFF/{if (a && c) printf "%s", buf; a=c=0}' text
AAA2
BBB
CCC2
DDD
EEE
FFF2
AAA4
BBB
CCC4
DDD
EEE
FFF4
You can consider your input as data-blocks, with AAA.* as the start-tag and FFF.* as your end-tag. Now collect each block into hold-space and at the end-tag, check if the block contains the desired pattern.
For example, here is a GNU sed version that does this:
parse.sed
# Start-tag -> start a new block in hold-space
/^AAA/ { h; b; }
# Save input
H
# End-tag AND block contains CCC -> print
/^FFF/ { x; /\nCCC/ p; }
Run it like this, e.g.:
sed -nf parse.sed | sed '/^FFF/G'
Or as a one-liner:
sed -n '/^AAA/{h;b};H;/^FFF/{x;/\nCCC/p}' | sed '/^FFF/G'
Output:
AAA2
BBB
CCC2
DDD
EEE
FFF2
AAA4
BBB
CCC4
DDD
EEE
FFF4
A more portable sed script looks like this:
# Start-tag -> start a new block in hold-space
/^AAA/ {
h
b
}
# Save input
H
# End-tag AND block contains CCC -> print
/^FFF/ {
x
/\nCCC/p
}
An awk variant
awk '/^AAA/{f=1} f{i=i $0 ORS} /^FFF/{if(i~/\nCCC/){printf "%s", i} i=f=""}' input

Keep lines only if a column is repeated three times within the file [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a column file and I want to keep the lines that contain the second column repeated exactly three times within the file.
Input:
000 BBB PPP DDD
111 BBB SSS 444
777 CCC RRR 555
222 BBB 555 666
321 AAA YYY MMM
123 CCC LLL MMM
OOO AAA BBB VVV
545 UUU 321 R32
PPP AAA HHH TTT
Desired output
000 BBB PPP DDD
111 BBB SSS 444
222 BBB 555 666
321 AAA YYY MMM
OOO AAA BBB VVV
PPP AAA HHH TTT
I have searched on the internet but nothing similar found. Any help is welcome. Thanks.
1st solution: Could you please try following, written and tested with shown samples in GNU awk.
awk '
{
cntIndArray[$2]++
valArray[$2]=(valArray[$2]?valArray[$2] ORS:"")$0
}
END{
for(i in cntIndArray){
if(cntIndArray[i]==3){
print valArray[i]
}
}
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
cntIndArray[$2]++ ##Creating array which keep trakcs of 2nd field occurence in lines.
valArray[$2]=(valArray[$2]?valArray[$2] ORS:"")$0 ##Creating array which keeps adding same 2nd field lines concatinating with a new line.
}
END{ ##Starting END block of this code here.
for(i in cntIndArray){ ##Traversing through array which has field count here.
if(cntIndArray[i]==3){ ##Checking if an element value equals 3 then do following.
print valArray[i] ##Printing array value with index i which has exact line value in it.
}
}
}' Input_file ##Mentioning Input_file name here.
2nd solution: In case you need output to be in same sequence in which 2nd field occurs in Input_file then try following.
awk '
!seen[$2]++{
cntIndArray[++count]=$2
}
{
cntArray[$2]++;
valArray[$2]=(valArray[$2]?valArray[$2] ORS:"")$0
}
END{
for(i=1;i<=count;i++){
if(cntArray[cntIndArray[i]]==3){
print valArray[cntIndArray[i]]
}
}
}' Input_file
$ awk 'NR==FNR{a[$2]++;next} a[$2]==3' file{,}
000 BBB PPP DDD
111 BBB SSS 444
222 BBB 555 666
321 AAA YYY MMM
OOO AAA BBB VVV
PPP AAA HHH TTT
note that this is a double pass approach, if the content is in a file works better (small files doesn't matter, very large files this will work but internally keeping the data might not). If the data is piped in this approach will not work.

AWK or sed way to paste non-adjacent lines

$ cat file
aaa bbb ccc
ddd eee
jjj kkk lll
mmm
nnn ooo ppp
The following AWK command will paste the 'mmm' line at the end of the 'ddd eee' line. Is there a simpler way to do this using AWK or sed?
$ awk 'FNR==NR {if (NR==4) foo=$0; next} FNR==2 {print $0" "foo; next} FNR==4 {next} 1' file file
aaa bbb ccc
ddd eee mmm
jjj kkk lll
nnn ooo ppp
To clarify: I want to paste line 4 at the end of line 2 in this particular file, with a single space between the 'ddd eee' and the 'mmm'. That's the task. Is there an AWK or sed solution that's simpler than the one I came up with?
This can be done in sed using the hold space:
sed '2{N;h;N;s/\n.*\n/ /;p;g;D;}' file
2{...} Run the enclosed commands on line two.
N;h;N Read next two lines into the pattern space, holding the first two.
s/\n.*\n/ / Substitute a space for the middle line.
p;g;D Print the pasted lines, load the hold space, and delete the
first line (leaving the one that was removed by the previous substitute).
or using captures (\(...\)) & back-references (\1, \2, etc.):
sed '2{N;N;s/\(\n.*\)\n\(.*\)/ \2\1/;}' file
2{...} Run the enclosed commands on line two.
N;N Read next two lines into the pattern space.
s/\(\n.*\)\n\(.*\)/ \2\1/ Swap the third and fourth line, joining the first and third lines.
\(\n.*\) Capture the third line, including the leading newline.
\n\(.*\) Capture the fourth line, excluding the leading newline.
/ \2\1/ Replace the matched portion (the third & fourth lines) with a space, followed by the second, and then the first capture groups.
This meets the letter of the amended problem statement — it prints line 1, appends line 4 after the content of line 2 as line 2, then prints line 3, and then prints line 5 and beyond:
awk 'NR == 1 || NR >= 5 { print; next }
NR == 2 { save2 = $0 }
NR == 3 { save3 = $0 }
NR == 4 { print save2, $0; print save3 }' file
It's simpler than the code in the question in that it only scans the file once.
The output:
aaa bbb ccc
ddd eee mmm
jjj kkk lll
nnn ooo ppp
Solution in TXR:
$ txr -c '#line1
#line2
#line3
#line4
#(data rest)
#(output)
#line1
#line2 #line4
#line3
# (repeat)
# rest
# (end)
#(end)' file
aaa bbb ccc
ddd eee mmm
jjj kkk lll
nnn ooo ppp
This is simpler:
$ awk 'FNR==NR {if (NR==4) foo=$0; next} FNR==2{$0=$0" "foo} FNR!=4' file file
aaa bbb ccc
ddd eee mmm
jjj kkk lll
nnn ooo ppp
Other solutions might be faster or use less memory but they won't be simpler.

How to compare two files, if same column then replace , using awk or sed

Two files:
f1:
1 aaa 123
2 bbb 555
3 ccc 666
f2:
1 aaa 444
2 ddd 666
3 eee 777
When f2 has same as f1 like aaa value, then the r1c3 of f2 replace by r1c3 of f1, using awk or sed.
Output:
1 aaa 123
2 ddd 666
3 eee 777
Thanks
Try this;
awk 'FNR==NR { a[$2]=$3; next }{ print $1,$2,( a[$2] ? a[$2] : $3 )}' f1 f2

Split Multiple Line values into Column using sed/awk between Pattern markers

I need to split the rows between the pattern markers1 and 2 into columns.
What i notice is that the Sed indicates a failure to split the columns. Could someone help?.
StartPattern1
AAA\n
BBB\n
CCC\n
EndPattern
Some text
StartPattern2
XXX\n
YYY\n
ZZZ\n
MMM\n
NNN\n
EndPattern2
Result Needed from sed/awk:
StartPattern1
AAA\tBBB\tCCC
End Pattern1
StartPattern2
XXX\tYYY\tZZZ\tMMM\tNNN\n
EndPattern2
This should work:
cat file
StartPattern1
AAA
BBB
CCC
EndPattern
Some text
StartPattern2
XXX
YYY
ZZZ
MMM
NNN
EndPattern2
awk '/StartPattern/ {f=1;print;next} f && ! /EndPattern/ {printf "%s%s",$0,(f?"\t":RS)} /EndPattern/ {f=0;print "\n"$0;next}' file
StartPattern1
AAA BBB CCC
EndPattern
StartPattern2
XXX YYY ZZZ MMM NNN
EndPattern2
Here is another one:
awk '/^Start/{f=g=1} /^End/{f=0;print "\n"$0} f {printf "%s%s",$0,(g--==1?RS:"\t")}' file
StartPattern1
AAA BBB CCC
EndPattern
StartPattern2
XXX YYY ZZZ MMM NNN
EndPattern2