Printing lines with duplicate words - awk

I am trying to print all line that can contain same word twice or more
E.g. with this input file:
cat dog cat
dog cat deer
apple peanut banana apple
car bus train plane
car train car train
Output should be
cat dog cat
apple peanut banana apple
car train car train.
I have tried this code and it works but I think there must be a shorter way.
awk '{ a=0;for(i=1;i<=NF;i++){for(j=i+1;j<=NF;j++){if($i==$j)a=1} } if( a==1 ) print $0}'
Later I want to find all such duplicate words and delete all the duplicate entries except for 1st occurrence.
So input:
cat dog cat lion cat
dog cat deer
apple peanut banana apple
car bus train plane
car train car train
Desired output:
cat dog lion
dog cat deer
apple peanut banana
car bus train plane
car train

You can use this GNU sed command:
sed -rn '/(\b\w+\b).*\b\1\b/ p' yourfile
-r activate extended re and n deactivates the implicit printing of every line
the p command then prints only lines that match the preceding re (inside the slashes):
\b\w+\b are words : an nonemtpy sequence of word charactes (\w) between word boundaries (\b`), these are GNU extensions
such a word is "stored" in \1 for later reuse, due to the use of parentheses
then we try to match this word with \b\1\b again with something optional (.*) between those two places.
and that is the whole trick: match something, put it in parentheses so you can reuse it in the same re with \1
To answer the second part of the question, deleting the doubled words after the first, but print all lines (modifying only the lines with doubled words), you could use some sed s magic:
sed -r ':A s/(.*)(\b\w+\b)(.*)\b\2\b(.*)/\1\2\3\4/g; t A ;'
here we use again the backreference trick.
but we have to account for the things before, between and after our doubled words, thus we have a \2 in the matching part of then s command and we have the other backreferences in the replacement part.
notice that only the \2 has no parens in the matching part and we use all groups in the replacement, thus we effectively deleted the second word of the pair.
for more repetitions of the word we need loop:
:A is a label
t A jumps to the label if there was a replacement done in the last s comamnd
this builds a "while loop" around the s to delete the other repetitions, too

Here's a solution for printing only lines that contain duplicate words.
awk '{
delete seen
for (i=1;i<=NF;++i) {
if (seen[$i]) { print ; next }
seen[$i] = 1
}
}'
Here's a solution for deleting duplicate words after the first.
awk '{
delete seen
for (i=1;i<=NF;++i) {
if (seen[$i]) { continue }
printf("%s ", $i);
seen[$i] = 1
}
print "";
}'
Re your comment...
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. — Jamie Zawinski, 1997

With egrep you can use a so called back reference:
egrep '(\b\w+\b).*\b\1\b' file
(\b\w+\b) matches a word at word boundaries in capturing group 1. \1 references that matched word in the pattern.

I'll show solutions in Perl as it is probably the most flexible tool for text parsing, especially when it comes to regular expressions.
Detecting Duplicates
perl -ne 'print if m{\b(\S+)\b.*?(\b\1\b)}g' file
where
-n causes Perl to execute the expression passed via -e for each input line;
\b matches word boundaries;
\S+ matches one or more non-space characters;
.*? is a non-greedy match for zero or more characters;
\1 is a backreference to the first group, i.e. the word \S+;
g globally matches the pattern repeatedly in the string.
Removing Duplicates
perl -pe '1 while (s/\b(\S+)\b.*?\K(\s\1\b)//g)' file
where
-p causes Perl to print the line ($_), like sed;
1 while loop runs as long as the substitution replaces something;
\K keeps the part matching the previous expression;
Duplicate words (\s\1\b) are replaced with empty string (//g).
Why Perl?
Perl regular expressions are known to be very flexible, and regular expressions in Perl are actually more than just regular expressions. For example, you can embed Perl code into the substitution using the /e modifier. You can use the /x modifier that allows to write regular expressions in a more readable format and even use Perl comments in it, e.g.:
perl -pe '1 while (
s/ # Begins substitution: s/pattern/replacement/flags
\b (\S+) \b # A word
.*? # Ungreedy pattern for any number of characters
\K # Keep everything that matched the previous patterns
( # Group for the duplicate word:
\s # - space
\1 # - backreference to the word
\b # - word boundary
)
//xg
)' file
As you should have noticed, the \K anchor is very convenient, but is not available in many popular tools including awk, bash, and sed.

Related

extracting a specific word between : using sed, awk or grep

I have a file that has the following contents and many more.
#set_property board_part my.biz:ab0860_1cf:part0:1.0 [current_project]
set_property board_part my.biz:ab0820_1ab:part0:1.0 [current_project]
My ideal output is as shown below (ie, the text after the first ":" and the second ":".
ab0820_1ab
I generally use python and use regular expression along the lines of below to get the result.
\s*set_property board_part trenz.biz:([a-zA-Z_0-9]+)
I wish to know how can it be done quickly and in a more generic way using commandline tools (sed, awk).
You might use GNU sed following way, let file.txt content be
#set_property board_part my.biz:ab0860_1cf:part0:1.0 [current_project]
set_property board_part my.biz:ab0820_1ab:part0:1.0 [current_project]
garbage garbage garbage
then
sed -n '/ set_property board_part my.biz/ s/[^:]*:\([^:]*\):.*/\1/ p' file.txt
gives output
ab0820_1ab
Explanation: -n turns off default printing, / set_property board_part my.biz/ is so-called address, following commands will be applied solely to lines matching adress. First command is substitution (s) which use capturing group denoted by \( and \), regular expression is as followes zero-or-more non-: (i.e. everything before 1st :), :, then zero-or-more non-: (i.e. everything between 1st and 2nd :) encased in capturing group : and zero-or-more any character (i.e. everything after 2nd :), this is replaced by content of 1st (and sole in this case) capturing group. After substitution takes place p command is issued to prompt GNU sed to print changed line.
(tested in GNU sed 4.2.2)
Your example data has my.biz but your pattern tries to match trenz.biz
If gnu awk is available, you can use the capture group and then print the first value of which is available in a[1]
awk 'match($0, /^\s*set_property board_part \w+\.biz:(\w+)/, a) {print a[1]}' file
The pattern matches:
^ Start of string
\s* Match optional whitespace chars
set_property board_part Match literally
\w+\.biz: Match 1+ word chars followed by .biz (note to escape the dot to match it literally)
(\w+) Capture group 1, match 1+ word chars
Notes
If you just want to match trenz.biz then you can replace \w+\.biz with trenz\.biz
If the strings are not at the start of the string, you can change ^ for \s wo match a whitespace char instead
Output
ab0820_1ab

sed: cut out a substring following various patterns

There is a list of identifiers I want to modify:
3300000526.a:P_A23_Liq_2_FmtDRAFT_1000944_2,
200254578.a:CR_10_Liq_3_inCRDRAFT_100545_11,
3300000110.a:BSg2DRAFT_c10006505_1,
3300000062.a:IMNBL1DRAFT_c0010786_1,
3300000558.a:Draft_10335283_1
I want to remove all starting from first . and first _ after DRAFT (case-insensitive), i.e.:
3300000526_1000944_2,
200254578_100545_11,
3300000110_c10006505_1,
3300000062_c0010786_1,
3300000558_10335283_1
I am using sed 's/.a.*[a-zA-Z0-9]DRAFT_.*[^_]_[a-zA-Z0-9]//g' but it ignores the first _ after DRAFT and does this:
3300000526_2,
200254578_11,
3300000110_1,
3300000062_1,
3300000558_1
P.S.
There can be various identifiers and I tried to show a little portion on their variance here, but they all keep same pattern.
I'd be grateful for corrections!
You could easily do this in awk, could you please try following once. Based on shown samples only.
awk -F'[.]|DRAFT_' '{$2="_";sub(/ +_ +/,"_")} 1' Input_file
OR with GNU awk for handling case-insensitive try:
awk -v IGNORECASE="1" -F'[.]|DRAFT_' '{$2="_";sub(/ +_ +/,"_")} 1' Input_file
To handle case insensitive without ignorecase option try:
awk -F'[.]|[dD][rR][aA][fF][tT]_' '{$2="_";sub(/ +_ +/,"_")} 1' Input_file
Explanation: Simply setting field separator as . OR DRAFT_ as per OP's need. Then in main program setting 2nd field to _ and then substituting spaces underscore spaces with only _. Finally printing the current line by 1.
A workable solution
You can use:
sed 's/[.].*[dD][rR][aA][fF][tT]_/_/' data
You could also use \. in place of [.] but I'm allergic to unnecessary backslashes — you might be too if you'd had to spend time fighting whether 8 or 16 consecutive backslashes was the correct way to write to document markup (using troff).
For your sample data, it produces:
3300000526_1000944_2,
200254578_100545_11,
3300000110_c10006505_1,
3300000062_c0010786_1,
3300000558_10335283_1
What went wrong
Your command is:
sed 's/.a.*[a-zA-Z0-9]DRAFT_.*[^_]_[a-zA-Z0-9]//g'
This matches:
any character (the leading .)
lower-case 'a'
any sequence of characters
an alphanumeric character
upper-case only DRAFT
underscore
any sequence of characters
underscore
an alphanumeric character
doing the match globally on each line
It then deletes all the matched material. You could rescue it by using:
sed 's/[.]a.*[a-zA-Z0-9]DRAFT\(_.*[^_]_[a-zA-Z0-9]\)/\1/'
This matches a dot rather than any character, and saves the material after DRAFT starting with the underscore (that's the \(…\)), replacing what was matched with what was saved (that's the \1). You can convert DRAFT to the case-insensitive pattern too, of course. However, the terms of the question refer to "from the first dot (.) up to (but not including) the underscore after a (case-insensitive) DRAFT", and detailing, saving, and replacing the material after the underscore is not necessary.
Laziness
I saved myself typing the elaborate case-insensitive string by using a program called mkpattern that (according to RCS) I created on 1990-08-23 (a while ago now). It's not rocket science. I use it every so often — I've actually used it a number of times in the last week, searching for some elusive documents on a system at work.
$ mkpattern DRAFT
[dD][rR][aA][fF][tT]
$
You might have to do that longhand in future.
try something like
{mawk/mawk2/gawk} 'BEGIN { FS = "[\056].+DRAFT_"; OFS = ""; } (NF < 2) || ($1 = $1)'
It might not be the fastest but it's relatively clean approach. octal \056 is the period, and it's less ambiguous to reader when the next item is a ".+"
This might work for you (GNU sed):
sed -nE 's/DRAFT[^_]*/\n/i;s/\..*\n//p' file
First turn on the -n and -E to turn off implicit printing and make regexp more easily seen.
Since we want the first occurrence of DRAFT we can not use a regexp that uses the .* idiom as this is greedy and may pass over it if there are two or more such occurrences. Thus we replace the DRAFT by a unique character which cannot occur in the line. The newline can only be introduced by the programmer and is the best choice.
Now we can use the .* idiom as the introduced newline can only exist if the previous substitution has matched successfully.
N.B. The i flag in the first substitution allows for any upper/lower case rendition of the string DRAFT, also the second substitution includes the p flag to print the successful substitution.

Mining dictionary for sed search strings

For fun I was mining the dictionary for words that sed could use to modify strings. Example:
sed settee <<< better
sed statement <<< dated
Outputs:
beer
demented
These sed swords must be at least 5 letters long, and begin with s, then another letter, which can appear only 3 times, with at least one other letter between the first and second instances, and with the third instance as the final letter.
I used sed to generate a word list, and it seems to work:
d=/usr/share/dict/american-english
sed -n '/^s\([a-z]\)\(.*\1\)\{2\}$/{
/^s\([a-z]\)\(.*\1\)\{3\}$/!{/^s\([a-z]\)\1/!p}}' $d |
xargs echo
Output:
sanatoria sanitaria sarcomata savanna secede secrete secretive segregate selective selvedge sentence sentience sentimentalize septette sequence serenade serene serpentine serviceable serviette settee severance severe sewerage sextette stateliest statement stealthiest stoutest straightest straightjacket straitjacket strategist streetlight stretchiest strictest structuralist
But that sed code runs three passes through each line, which seems excessively long and kludgy. How can that code be simplified, while still outputting the same word list?
grep or awk answers would also be OK.
awk to the rescue!
code is cleaner with awk and reads as the spec: split the word based on the second char, three instances of the char will split the word into 4 segments; 2nd one should have at least one char and the last one should be empty.
$ awk '/^s/{n=split($1,a,substr($1,2,1));
if(n==4 && length(a[2])>0 && a[4]=="") print}' /usr/share/dict/american-english | xargs
sanatoria sanitaria sarcomata savanna secede secrete secretive
segregate selective selvedge sentence sentience sentimentalize
septette sequence serenade serene serpentine serviceable serviette
settee severance severe sewerage sextette stateliest statement
stealthiest stoutest straightest straightjacket straitjacket strategist
streetlight stretchiest strictest structuralist
very cool idea. I think you're more restrictive than necessary
sed -nE '/^s(.)[^\1]+\1[^\1]*\1g?$/p'
seems to work fine. It generated 518 words for me. I only have /usr/share/dict/words dictionary file though.
sabadilla sabakha sabana sabbatia sabdariffa sacatra saccharilla
saccharogalactorrhea saccharorrhea saccharosuria saccharuria sacralgia
sacraria sacrcraria sacrocoxalgia sadhaka sadhana sahara saintpaulia
salaceta salada salagrama salamandra saltarella salutatoria
...
stuntist subbureau sucuriu sucuruju sulphurou surucucu
syenite-porphyry symphyseotomy symphysiotomy symphysotomy symphysy
symphytically syndactyly synonymity synonymously synonymy
syzygetically syzygy
an interesting find is
$ sed snow-nodding <<< now-or-never
noddior-never
A speedy pcregrep method, (.025 seconds user time):
d=/usr/share/dict/american-english
pcregrep '^s(.)((?!\1).)+\1((?!\1).)*\1$' $d | xargs echo
Output:
sanatoria sanitaria sarcomata savanna secede secrete secretive segregate selective selvedge sentence sentience sentimentalize septette sequence serenade serene serpentine serviceable serviette settee severance severe sewerage sextette stateliest statement stealthiest stoutest straightest straightjacket straitjacket strategist streetlight stretchiest strictest structuralist
Code inspired by: Regex: Match everything except backreference

awk statement within sed

I have multiple occurrences of the pattern:
)0.[0-9][0-9][0-9]:
where [0-9] is any digit, in various text context but the pattern is unique as this regex. And I need to turn the decimal fraction into integer (percent values from 0 to 99).
A small example substring would be
=1:0.00055)0.944:0.02762)0.760:0 to turn into
=1:0.00055)94:0.02762)76:0
What I’m doing is :
cat file | sed -e "s/)\([0-9].[0-9][0-9][0-9]\):/)`echo "\1"|awk '{ r=int(100*$0); if((r>=0)&&(r<=100)){ print r; } else { print "error"; exit(-1); } }'`:/g"
but the output is )0:
where is the fault?...
Since you asked 'where is the fault' and not 'how to solve the problem':
Your backquoted pipeline echo ...|awk ... is executed FIRST, producing a single 0 which is then made part of the s/// command passed to sed and thus substituted everywhere the pattern matches. PS: using the newer (post-Reagan) and more flexible notation for command substitution $( ... ) instead of backquotes is preferred in all shells except csh family, and especially on Stack where backquotes are special to markdown and troublesome to show in text.
If you want to actually solve the problem, which you didn't describe clearly or completely, some pointers toward a better direction:
Standard sed can't execute a command to generate replacement text; GNU sed can with flag e but you need to make the whole patternspace the command and fiddle anything else into holdspace, which is tedious. perl can evaluate an expression in the replacement for s, including arithmetic; awk (even gawk) can't do so directly, but you can get the same effect by doing the match and the replace/rebuild as separate steps, depending on the unspecified and unclear details of exactly what you want to do; if you want to keep the rest of the line unchanged, something like:
awk 'match($0,/)0[.][0-9][0-9][0-9]:/){ print substr($0,1,RSTART) (substr($0,RSTART+1,RLENGTH-2)*100) substr($0,RSTART+RLENGTH-1) }'
But you don't actually need arithmetic here if you're satisified with truncating. Just discard the leading 0. and the last digit and keep the two digits in between:
sed 's/)0[.]\([0-9][0-9]\)[0-9]:/)0.\1:/g'
Note . in regexp unless escaped or in a charclass (as I did) matches any character not just period, which may or may not be a problem since you didn't give the rest of your input.
And PS: negative numbers for process exit status don't work (except IIRC Plan 9). Use small (usually < 128) positive status values for errors; most common is to just use 1.
Check this perl one-liner command :
perl -pe 's/\)(\d+\.\d+):/sprintf ")%d:", $1 * 100/ge' file
Before :
=1:0.00055)0.944:0.02762)0.760:0
After :
=1:0.00055)94:0.02762)76:0
If you need to replace for real in editing mode, add -i switch :
perl -i -pe '...'

Extract text between symbols with bash

First off, I am relatively new at this, so please bear with me.
I have an annotated transcriptome .fasta file, which contains ~60,000 records of genes like these two:
>comp35897_c0_seq11 len=1039 path=[11:0-12;24:13-1038] Match_Acc=E5SX33 Gene=Putative_CAP-Gly_domain_protein
TTTTAAATTGATTACTTTGCTATTTTTGGCAATGTTGGACTGAGTTGTCGTATTTTTTCG
>comp32620_c0_seq3 len=1874 path=[1:0-195;197:196-220;222:221-354;356:355-481;4197:482-487;489:488-579;581:580-1159;1161:1160-1712;1714:1713-1729;1731:1730-1794;5873:1795-1873] Match_Acc=K1PQJ1 Gene=HAUS_augmin-like_complex_subunit_3 GO=GO:0051225,GO:0070652
CAGACTTTTGGATTTAGTACATGTATGTATGAATATGTGTTTCAATGTACAACTCAGGAT
I am trying to create a two-column, space-delimited .tab with component number in the first column and gene name in the second column. I have looked at many similar posts using grep, sed, or awk, but none of the suggested code has worked for me.
Specifically, what I need to pull from the .fasta is the comp number between the > and the next space for the first column, and gene name between Gene=and the next space. For the two genes above, that should give me:
comp35897_c0_seq11 Putative_CAP-Gly_domain_protein
comp32620_c0_seq3 HAUS_augmin-like_complex_subunit_3
Any help would be much appreciated!
with awk:
Skip's gene name if 'gene' is absent
awk 'BEGIN{RS=">"} NF>1{if($5 ~ /Gene=/){gsub("Gene=","",$5); print $1,$5} else {print $1}}' < transcriptome.fasta > space-delimited.tab
Output:
comp35897_c0_seq11 Putative_CAP-Gly_domain_protein
comp32620_c0_seq3
Skip's the record if 'gene' is absent
awk 'BEGIN{RS=">"} NF>1{if($5 ~ /Gene=/){gsub("Gene=","",$5); print $1,$5}}' < transcriptome.fasta > space-delimited.tab
Output:
comp35897_c0_seq11 Putative_CAP-Gly_domain_protein
Have you tried anything yet?
with sed you could do:
sed 's/>\(comp[^ ]\+\) \+.*Gene=\([^ ]\+\) .*$/\1 \2/'
which looks complex but is relatively easy to understand if you take it slowly and break it down into it's component parts.
edit
ok, so to ensure sed only outputs what you want, you need to switch on the 'no output by default' mode -n and explicitly print each line you are interested in p
I'll try to break it down, so that it is understandable.
comp[^ ]\+ #is a regex that says:
#text that starts with the string 'comp'
#and is followed by at least one character
#that is anything that isn't a space (the [^ ])
\(comp[^ ]\+\) #is the sed construct that remembers what
#that regex matches.
.* #is the regex for zero or more of any chars.
'Gene=\([^ ]\+\) ' #look for the string 'Gene' followed by an
#equals sign, followed by at least one char
#that isn't a space, followed by a space
#oh, and remember the bit after = and before the space
so, along with the -n and the p switches for sed you could use:
sed -n 's/>\(comp[^ ]\+\) \+.*Gene=\([^ ]\+\) .*$/\1 \2/p'