Print the duplicate lines in a file using awk - awk

I have a requirement to print all the duplicated lines in a file where in uniq -D option did not support. So I am thinking of an alternative way to print the duplicate lines using awk. I know that, we have an option in awk like below.
testfile.txt
apple
apple
orange
orange
cherry
cherry
kiwi
strawberry
strawberry
papaya
cashew
cashew
pista
The command:
awk 'seen[$0]++' testfile.txt
But the above does print only the unique duplicate lines. I need the same output that uniq -D command retrieves like this.
apple
apple
orange
orange
cherry
cherry
strawberry
strawberry
cashew
cashew

No need to parse the file twice:
$ awk 'c[$0]++; c[$0]==2' file
apple
apple
orange
orange
cherry
cherry
strawberry
strawberry
cashew
cashew

If you want to stick with just plain awk, you'll have to process the file twice: once to generate the counts, once to eliminate the lines with count equal 1:
awk 'NR==FNR {count[$0]++; next} count[$0]>1' testfile.txt testfile.txt

With sed:
$ sed 'N;/^\(.*\)\n\1$/p;$d;D' testfile.txt
apple
apple
orange
orange
cherry
cherry
strawberry
strawberry
cashew
cashew
This does the following:
N # Append next line to pattern space
/^\(.*\)\n\1$/p # Print if lines in pattern space are identical
$d # Avoid printing lone non-duplicate last line
D # Delete first line in pattern space
There are a few limitations:
It only works for contiguous duplicates, i.e., not for
apple
orange
apple
Lines appearing more than twice in a row throw it off.

This might work for you (GNU sed):
sed -rn ':a;N;/^([^\n]*)\n\1$/p;//ba;/^([^\n]*)(\n\1)+$/P;//ba;s/.*\n//;ba' file
Read two lines into the pattern space (PS). If the first two lines are duplicate, print them and loop back and read a third line. If the third or subsequent lines are duplicate, print the first and loop back and read another line. Otherwise, remove all but the last line and loop back and read another etc.

Something like this, if uniq supports -d?
grep -f <(uniq -d testfile.txt ) testfile.txt

awk '{if (x[$1]) { x_count[$1]++; print $0; if (x_count[$1] == 1) { print x[$1] } } x[$1] = $0}' testfile.txt

You can do:
$ uniq -d file | awk '1;1'

Related

How do I extract specific lines based on a comparison of two files with sed and/or awk?

I need to extract all the lines from file2.txt that do not match the string until the first dot in any line in file1.txt. I am interested in a solution that stays as close to my current approach as possible so it is easy for me to understand, and uses only sed and/or awk in linux bash.
file1.h
apple.sweet banana
apple.tasty banana
apple.brown banana
orange_mvp.rainy day.here
orange_mvp.ear nose.png
lemon_mvp.ear ring
tarte_mvp_rainy day.here
file2.h
orange_mvp
lemon_mvp
lemon_mvp
tarte_mvp
cake_mvp
result desired
tarte_mvp
cake_mvp
current wrong approach
$ awk '
NR==FNR { sub(/mvp(\..*)$/,""); a[$0]; next }
{ f=$0; sub(/mvp(\..*)$/,"", f) }
!(f in a)
' file2.h file1.h
apple.sweet banana
apple.tasty banana
apple.brown banana
orange_mvp.rainy day.here
orange_mvp.ear nose.png
lemon_mvp.ear ring
tarte_mvp_rainy day.here
Using awk
$ awk -F. 'NR==FNR {a[$1]=$1;next} a[$1] != $0' file1.h file2.h
tarte_mvp
cake_mvp
The answer by #HatLess is very nice and idiomatic. If you find it a bit cryptic, you can also consider this one, in program program.awk:
BEGIN {
while(getline prefix < "file1.txt") {
gsub("[.].*", "", prefix)
ignore[prefix]
}
}
!($0 in ignore) {
print($0)
}
Invoked with awk -f program.awk file2.txt.
In the BEGIN block we read all the lines from file1.txt and store the prefixes we want to ignore as keys of an hash table.
Then we process the file2.txt and print all the lines which are selected (not ignored).

Is there a simple awk/sed way to print list in YAML file?

I'm looking for an optimized way to print a particular list in a YAML file using sed or/and awk.
For example, in the below sample yaml file, how do I get the list of Fruits alone printed on screen,say, comma separated?
Input File: boston_mart.yaml
What I am able to acheive using awk is to print after "Fruits:" but I also want another contition to print only if "-" is in front of the words. That is where I am stuck. Any help or pointers will be very helpful.
## YAML
Market: open
Season: fall
Fruits:
- apple
- orange
- banana
- grapes
Vendors: 7
Buyers: 5
Vegetables:
- tomato
- carrot
- broccoli
Location: Boston
Output
apple
orange
banana
grapes
$ awk '$1 == "-"{ if (key == "Fruits:") print $NF; next } {key=$1}' file
apple
orange
banana
grapes
sed -n '/^Fruits:/,/^[^-]/{//b; s/^- //p}' -- data
where:
-n by default don't echo input
/…/,/…/ handle range of lines from 'Fruits:' to one not starting with '-' (incl. EOF)
//b skip line if last regex matched (i.e. range bounds)
otherwise strip dash and print line
Output:
apple
orange
banana
grapes
This might work for you (GNU sed):
sed -En '/:/h;G;s/^- (.*)\nFruits:/\1/p' file
Make a copy of the key.
Append the copy to each line.
If a line starts - and contains the required key, print its value.
awk '$1~/:/{key=$1} key~/Fruits:/&&$1=="-"{print $2} ' file
Output:
apple
orange
banana
grapes

How to remove all lines after a line containing some string?

I need to remove all lines after the first line that contains the word "fox".
For example, for the following input file "in.txt":
The quick
brown fox jumps
over
the
lazy dog
the second fox
is hunting
The result will be:
The quick
brown fox jumps
I prefer a script made in awk or sed but any other command line tools are good, like perl or php or python etc.
I am using gnuwin32 tools in Windows, and the solution I could find was this one:
grep -m1 -n fox in.txt | cut -f1 -d":" > var.txt
set /p MyNumber=<var.txt
head -%MyNumber% in.txt > out.txt
However, I am looking for a solution that is shorter and that is portable (this one contains Windows specific command set /p).
Similar questions:
How to delete all the lines after the last occurence of pattern?
How to delete lines before a match perserving it?
Remove all lines before a match with sed
How to delete the lines starting from the 1st line till line before encountering the pattern '[ERROR] -17-12-2015' using sed?
How to delete all lines before the first and after the last occurrence of a string?
awk '{print} /fox/{exit}' file
With GNU sed:
sed '0,/fox/!d' file
or
sed -n '0,/fox/p' file

In a file with two word lines, grep only those lines which have both words from a whitelist

I have a file1:
green
yellow
apple
mango
and a file2:
red apple
blue banana
yellow mango
purple cabbage
I need to find elements from file2 where both words belong to the list in file1. So it should show:
yellow mango
I tried:
awk < file2 '{if [grep -q $1 file1] && [grep -q $2 file1]; then print $0; fi}'
I am getting syntax error.
This will do the trick:
$ awk 'NR==FNR{a[$0];next}($1 in a)&&($2 in a)' file1 file2
yellow mango
Explanation:
NR is a special awk variable the tracks the current line in the input and FNR tracks the current line in each individual file so the condition NR==FNR is only true when we are in the first file. a is a associative array where the keys are each unique line in the first file. $0 is the value of the current line in awk. The next statement jumps to the next line in file to the next part of skip is not executed. The final part is straight forward if the first field $1 is in the array a and the second field then print the current line. The default block in awk is {print $0} so this is implicit.
This is a very hackish approach and probably frowned upon by many of the grep/sed implementors. In addition it is probably terminal dependent. You have been warned.
GNU grep, when in color mode, highlights pieces of the input that were matched by one of the patterns, this could in theory, be used as a test for a full match. Here, this even works in practice, that is, with some help from GNU sed:
grep --color=always -f file1 file2 | sed -n '/^\x1b.*\x1b\[K *\x1b.*\x1b\[K$/ { s/\x1b\[K//g; s/\x1b[^m]*m//gp }'
Output:
yellow mango
Note that the sed pattern assumes space separated columns in file2.
You can do it with bash, sed and grep:
grep -f <(sed 's/^/^/' file1) file2 | grep -f <(sed 's/$/$/' file1)
this is a bit obscure, so I will break it down:
grep -f <file> reads a sequence of patterns from a file and will match on any of them.
<(...) is bash process substitution and will execute a shell command and create a pseudo-file with the output that can be used in place of a filename.
sed 's/^/^/' file1 inserts a ^ character at the start of each line in file1, turning the lines into patterns that will match the first word of file2.
sed 's/$/$/' file1 inserts a $ character at the end, so the patterns will match the second word.
Edit:
Use:
grep -f <(sed 's/^/^/;s/$/\b/' file1) file2 | grep -f <(sed 's/$/$/;s/^/\b/' file1)
to get round the issue that Jonathan pointed out in his comment.

sed script to print the first three words in each line

I wonder how can I do the following thing with sed:
I need to keep only the first three words in each line.
For example, the following text:
the quick brown fox jumps over the lazy bear
the blue lion is hungry
will be transformed in:
the quick brown
the blue lion
In awk you can say:
{print $1, $2, $3}
You can use cut like this:
cut -d' ' -f1-3
I would suggest awk in this situation:
awk '{print $1,$2,$3}' ./infile
% (echo "A B C D E F G H";echo "a b c d e f g h") | sed -E 's/([^\s].){3}//'
I put the "-E" in there for OS X compatibility. Other Unix systems may or may not need it.
edit: damnitall - brainfart. use this:
% sed -E 's/(([^ ]+ ){3}).*/\1/' <<END
the quick brown fox jumps over the lazy bear
the blue lion is hungry
END
the quick brown
the blue lion
Just using the shell
while read -r a b c d
do
echo $a $b $c
done < file
Ruby(1.9)+
ruby -ane 'print "#{$F[0]} #{$F[1]} #{$F[2]}\n"' file
If you need a sed script, you can try:
echo "the quick brown fox jumps over the lazy bear" | sed 's/^\([a-zA-Z]\+\ [a-zA-Z]\+\ [a-zA-Z]\+\).*/\1/'
But I think it would be easier using cut:
echo "the quick brown fox jumps over the lazy bear" | cut -d' ' -f1,2,3
Here's an ugly one with sed:
$ echo the quick brown fox jumps over the lazy bear | sed 's|^\(\([^[:space:]]\+[[:space:]]\+\)\{2\}[^[:space:]]\+\).*|\1|'
the quick brown
If Perl is an option:
perl -lane 'print "$F[0] $F[1] $F[2]"' file
or
perl -lane 'print join " ", #F[0..2]' file