Keep lines only if a column is repeated three times within the file [closed] - awk

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a column file and I want to keep the lines that contain the second column repeated exactly three times within the file.
Input:
000 BBB PPP DDD
111 BBB SSS 444
777 CCC RRR 555
222 BBB 555 666
321 AAA YYY MMM
123 CCC LLL MMM
OOO AAA BBB VVV
545 UUU 321 R32
PPP AAA HHH TTT
Desired output
000 BBB PPP DDD
111 BBB SSS 444
222 BBB 555 666
321 AAA YYY MMM
OOO AAA BBB VVV
PPP AAA HHH TTT
I have searched on the internet but nothing similar found. Any help is welcome. Thanks.

1st solution: Could you please try following, written and tested with shown samples in GNU awk.
awk '
{
cntIndArray[$2]++
valArray[$2]=(valArray[$2]?valArray[$2] ORS:"")$0
}
END{
for(i in cntIndArray){
if(cntIndArray[i]==3){
print valArray[i]
}
}
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
cntIndArray[$2]++ ##Creating array which keep trakcs of 2nd field occurence in lines.
valArray[$2]=(valArray[$2]?valArray[$2] ORS:"")$0 ##Creating array which keeps adding same 2nd field lines concatinating with a new line.
}
END{ ##Starting END block of this code here.
for(i in cntIndArray){ ##Traversing through array which has field count here.
if(cntIndArray[i]==3){ ##Checking if an element value equals 3 then do following.
print valArray[i] ##Printing array value with index i which has exact line value in it.
}
}
}' Input_file ##Mentioning Input_file name here.
2nd solution: In case you need output to be in same sequence in which 2nd field occurs in Input_file then try following.
awk '
!seen[$2]++{
cntIndArray[++count]=$2
}
{
cntArray[$2]++;
valArray[$2]=(valArray[$2]?valArray[$2] ORS:"")$0
}
END{
for(i=1;i<=count;i++){
if(cntArray[cntIndArray[i]]==3){
print valArray[cntIndArray[i]]
}
}
}' Input_file

$ awk 'NR==FNR{a[$2]++;next} a[$2]==3' file{,}
000 BBB PPP DDD
111 BBB SSS 444
222 BBB 555 666
321 AAA YYY MMM
OOO AAA BBB VVV
PPP AAA HHH TTT
note that this is a double pass approach, if the content is in a file works better (small files doesn't matter, very large files this will work but internally keeping the data might not). If the data is piped in this approach will not work.

Related

AWK or sed way to paste non-adjacent lines

$ cat file
aaa bbb ccc
ddd eee
jjj kkk lll
mmm
nnn ooo ppp
The following AWK command will paste the 'mmm' line at the end of the 'ddd eee' line. Is there a simpler way to do this using AWK or sed?
$ awk 'FNR==NR {if (NR==4) foo=$0; next} FNR==2 {print $0" "foo; next} FNR==4 {next} 1' file file
aaa bbb ccc
ddd eee mmm
jjj kkk lll
nnn ooo ppp
To clarify: I want to paste line 4 at the end of line 2 in this particular file, with a single space between the 'ddd eee' and the 'mmm'. That's the task. Is there an AWK or sed solution that's simpler than the one I came up with?
This can be done in sed using the hold space:
sed '2{N;h;N;s/\n.*\n/ /;p;g;D;}' file
2{...} Run the enclosed commands on line two.
N;h;N Read next two lines into the pattern space, holding the first two.
s/\n.*\n/ / Substitute a space for the middle line.
p;g;D Print the pasted lines, load the hold space, and delete the
first line (leaving the one that was removed by the previous substitute).
or using captures (\(...\)) & back-references (\1, \2, etc.):
sed '2{N;N;s/\(\n.*\)\n\(.*\)/ \2\1/;}' file
2{...} Run the enclosed commands on line two.
N;N Read next two lines into the pattern space.
s/\(\n.*\)\n\(.*\)/ \2\1/ Swap the third and fourth line, joining the first and third lines.
\(\n.*\) Capture the third line, including the leading newline.
\n\(.*\) Capture the fourth line, excluding the leading newline.
/ \2\1/ Replace the matched portion (the third & fourth lines) with a space, followed by the second, and then the first capture groups.
This meets the letter of the amended problem statement — it prints line 1, appends line 4 after the content of line 2 as line 2, then prints line 3, and then prints line 5 and beyond:
awk 'NR == 1 || NR >= 5 { print; next }
NR == 2 { save2 = $0 }
NR == 3 { save3 = $0 }
NR == 4 { print save2, $0; print save3 }' file
It's simpler than the code in the question in that it only scans the file once.
The output:
aaa bbb ccc
ddd eee mmm
jjj kkk lll
nnn ooo ppp
Solution in TXR:
$ txr -c '#line1
#line2
#line3
#line4
#(data rest)
#(output)
#line1
#line2 #line4
#line3
# (repeat)
# rest
# (end)
#(end)' file
aaa bbb ccc
ddd eee mmm
jjj kkk lll
nnn ooo ppp
This is simpler:
$ awk 'FNR==NR {if (NR==4) foo=$0; next} FNR==2{$0=$0" "foo} FNR!=4' file file
aaa bbb ccc
ddd eee mmm
jjj kkk lll
nnn ooo ppp
Other solutions might be faster or use less memory but they won't be simpler.

How to compare two files, if same column then replace , using awk or sed

Two files:
f1:
1 aaa 123
2 bbb 555
3 ccc 666
f2:
1 aaa 444
2 ddd 666
3 eee 777
When f2 has same as f1 like aaa value, then the r1c3 of f2 replace by r1c3 of f1, using awk or sed.
Output:
1 aaa 123
2 ddd 666
3 eee 777
Thanks
Try this;
awk 'FNR==NR { a[$2]=$3; next }{ print $1,$2,( a[$2] ? a[$2] : $3 )}' f1 f2

Split Multiple Line values into Column using sed/awk between Pattern markers

I need to split the rows between the pattern markers1 and 2 into columns.
What i notice is that the Sed indicates a failure to split the columns. Could someone help?.
StartPattern1
AAA\n
BBB\n
CCC\n
EndPattern
Some text
StartPattern2
XXX\n
YYY\n
ZZZ\n
MMM\n
NNN\n
EndPattern2
Result Needed from sed/awk:
StartPattern1
AAA\tBBB\tCCC
End Pattern1
StartPattern2
XXX\tYYY\tZZZ\tMMM\tNNN\n
EndPattern2
This should work:
cat file
StartPattern1
AAA
BBB
CCC
EndPattern
Some text
StartPattern2
XXX
YYY
ZZZ
MMM
NNN
EndPattern2
awk '/StartPattern/ {f=1;print;next} f && ! /EndPattern/ {printf "%s%s",$0,(f?"\t":RS)} /EndPattern/ {f=0;print "\n"$0;next}' file
StartPattern1
AAA BBB CCC
EndPattern
StartPattern2
XXX YYY ZZZ MMM NNN
EndPattern2
Here is another one:
awk '/^Start/{f=g=1} /^End/{f=0;print "\n"$0} f {printf "%s%s",$0,(g--==1?RS:"\t")}' file
StartPattern1
AAA BBB CCC
EndPattern
StartPattern2
XXX YYY ZZZ MMM NNN
EndPattern2

Assigning ID to each line by column value in awk

I have a tabular file with a column that is recurrent example
toto tata AFG
fff ddd AFG
ff hhh AWM
qqq ttt AWM
I would like to have an output like
toto tata AFG 1
fff ddd AFG 1
ff hhh AWM 2
qqq ttt AWM 2
by comparing each line to the next one using the 4th column
Is it possible to do it fast with awk ?
thx for help
awk '$3 != current {id++; current=$3} {print $0, id}'
Put input in a file.
$> cat ./text
toto tata AFG
fff ddd AFG
ff hhh AWM
qqq ttt AWM
For each line we should remember $3-st value and check if it is equal to the previous one. If it is true - we should increment the iterator.
awk '
BEGIN {
prevValue = "";
value = "";
iterator = 0;
}
{
prevValue = value;
value = $3;
if (value != prevValue)
iterator++;
printf $0 " " iterator "\n"
}' ./text
So what we get is this
toto tata AFG 1
fff ddd AFG 1
ff hhh AWM 2
qqq ttt AWM 2
UPD:
Like Jonathan Leffler said initial section is not really necessary here. So another workable solution is:
awk '
{
prevValue = value
value = $3
if (value != prevValue)
iterator++
print $0, iterator
}' ./text

Using awk for a table lookup

I would like to use awk to lookup a value from a text file. The text file has a very simple format:
text \t value
text \t value
text \t value
...
I want to pass the actual text for which the value should be looked up via a shell variable, e.g., $1.
Any ideas how I can do this with awk?
your help is great appreciated.
All the best,
Alberto
You can do this in a pure AWK script without a shell wrapper:
#!/usr/bin/awk -f
BEGIN { key = ARGV[1]; ARGV[1]="" }
$1 == key { print $2 }
Call it like this:
./lookup.awk keyval lookupfile
Example:
$ cat lookupfile
aaa 111
bbb 222
ccc 333
ddd 444
zzz 999
mmm 888
$ ./lookup.awk ddd lookupfile
444
$ ./lookup.awk zzz lookupfile
999
This could even be extended to select the desired field using an argument.
#!/usr/bin/awk -f
BEGIN { key = ARGV[1]; field = ARGV[2]; ARGV[1]=ARGV[2]="" }
$1 == key { print $field }
Example:
$ cat lookupfile2
aaa 111 abc
bbb 222 def
ccc 333 ghi
ddd 444 jkl
zzz 999 mno
mmm 888 pqr
$ ./lookupf.awk mmm 1 lookupfile2
mmm
$ ./lookupf.awk mmm 2 lookupfile2
888
$ ./lookupf.awk mmm 3 lookupfile2
pqr
Something like this would do the job:
#!/bin/sh
awk -vLOOKUPVAL=$1 '$1 == LOOKUPVAL { print $2 }' < inputFile
Essentially you set the lookup value passed into the shell script in $1 to an awk variable, then you can access that within awk itself. To clarify, the first $1 is the shell script argument passed in on the command line, the second $1 (and subsequent $2) are fields 1 and 2 of the input file.
TEXT=`grep value file | cut -f1`
I think grep might actually be a better fit:
$ echo "key value
ambiguous correct
wrong ambiguous" | grep '^ambiguous ' | awk ' { print $2 } '
The ^ on the pattern is to match to the start of the line and ensure that you don't match a line where the value, rather than the key, was the desired text.