sed script to print the first three words in each line - scripting

I wonder how can I do the following thing with sed:
I need to keep only the first three words in each line.
For example, the following text:
the quick brown fox jumps over the lazy bear
the blue lion is hungry
will be transformed in:
the quick brown
the blue lion

In awk you can say:
{print $1, $2, $3}

You can use cut like this:
cut -d' ' -f1-3

I would suggest awk in this situation:
awk '{print $1,$2,$3}' ./infile

% (echo "A B C D E F G H";echo "a b c d e f g h") | sed -E 's/([^\s].){3}//'
I put the "-E" in there for OS X compatibility. Other Unix systems may or may not need it.
edit: damnitall - brainfart. use this:
% sed -E 's/(([^ ]+ ){3}).*/\1/' <<END
the quick brown fox jumps over the lazy bear
the blue lion is hungry
END
the quick brown
the blue lion

Just using the shell
while read -r a b c d
do
echo $a $b $c
done < file
Ruby(1.9)+
ruby -ane 'print "#{$F[0]} #{$F[1]} #{$F[2]}\n"' file

If you need a sed script, you can try:
echo "the quick brown fox jumps over the lazy bear" | sed 's/^\([a-zA-Z]\+\ [a-zA-Z]\+\ [a-zA-Z]\+\).*/\1/'
But I think it would be easier using cut:
echo "the quick brown fox jumps over the lazy bear" | cut -d' ' -f1,2,3

Here's an ugly one with sed:
$ echo the quick brown fox jumps over the lazy bear | sed 's|^\(\([^[:space:]]\+[[:space:]]\+\)\{2\}[^[:space:]]\+\).*|\1|'
the quick brown

If Perl is an option:
perl -lane 'print "$F[0] $F[1] $F[2]"' file
or
perl -lane 'print join " ", #F[0..2]' file

Related

No difference in output running 'cut -f 1 sample.txt'

echo 'The quick brown; fox jumps over the lazy dog' > sample.txt
I then run
cut -f 1 sample.txt
or
cut -f 2 sample.txt
and my output is always the same,
The quick brown; fox jumps; over the lazy dog;
should't the output of the first 'cut' command be 'dog'? Why is the output the same if I run each 'cut' command?
the default separator is tab, if you want space instead set it with -d (delimeter)
cut -f 1 -d ' ' sample.txt
The
https://en.wikibooks.org/wiki/Cut

Awk to print a single next word followin a pattern match

This Q is a variation on the theme of printing something after a pattern.
There will be input lines with words. Some lines will match a pattern where the pattern will be one or multiple words separated by space. The pattern might have a leading/trailing space which needs to be obeyed. I need to print the word immediately following the match.
Example input
The quick brown fox jumps over the lazy dog
Pattern : "brown fox "
Desired output : jumps
The pattern will only occur once in the line. There will always be a word following the pattern. There will be lines without the pattern.
awk or sed would be nice.
Cheers.
EDIT :
I failed to ask the question properly. There will be one or more spaces between the pattern and the next word. This breaks Andre's proposal.
% echo -e "The quick brown fox jumps over the lazy dog\n" | awk -F 'brown fox ' 'NF>1{ sub(/ .*/,"",$NF); print $NF }'
jumps
% echo -e "The quick brown fox jumps over the lazy dog\n" | awk -F 'brown fox ' 'NF>1{ sub(/ .*/,"",$NF); print $NF }'
This works, given that the desired word is followed by a space:
$ echo -e "The quick brown fox jumps over the lazy dog\n" > file
$ awk -F 'brown fox ' 'NF>1{ sub(/ .*/,"",$NF); print $NF }' file
jumps
Edit:
If there're more spaces use this:
$ awk -F 'brown fox' 'NF>1{ sub(/^ */,"",$NF);
sub(/ .*/,"",$NF); print $NF }' file
Disclaimer: this solution assumes that if no pattern is found (There will be lines without the pattern.) it is appropriate to print empty line, if this does not hold true ignore this answer entirely.
I would use AWK for this following way, let file.txt content be
The quick brown fox jumps over the lazy dog
No animals in this line
The quick brown fox jumps over the lazy dog
then
awk 'BEGIN{FS="brown fox *"}{sub(/ .*/,"",$2);print $2}' file.txt
output
jumps
jumps
Explanation: I set field seperator FS to "brown fox " followed by any numbers of spaces. What is after this will appear in 2nd column, I jettison from 2nd column anything which is after first space including said space, then print that column. In case there is no match, second column is empty and these actions result in empty line.
With GNU grep:
$ grep -oP '(?<=brown fox )(\w+)' file
jumps
If you have more than 1 space after the match:
$ echo 'The quick brown fox jumps over the lazy dog' | grep -oP '(?<=\bbrown fox\b)\s+\K(\w+)'
jumps
Perl, with the same regex:
$ perl -lnE 'print $1 if /(?<=\bbrown fox )(\w+)/' file
Or, if you have multiple spaces:
$ perl -lnE 'print $1 if /(?<=brown fox)\s+(\w+)/' file
(As stated in comments, both the GNU grep and Perl regex could be \bbrown\h+fox\h+\K\w+ which has the advantage of supporting multiple spaces between brown and fox)
With awk, you can split on the string and split the result (this works as-is for multi spaces):
pat='brown fox'
awk -v pat="$pat" 'index($0, pat){
split($0,arr, pat)
split(arr[2], arr2)
print arr2[1]}' file
With GNU awk, you might also use a capture group with function match.
\ybrown\s+fox\s+(\w+)
\y A word boundary
brown\s+ Match brown and 1+ whitespace chars
fox\s+ Match fox and 1+ whitespace chars
(\w+) Capture 1+ word chars in group 1
In awk, get the group 1 value using arr[1]
Example
echo "The quick brown fox jumps over the lazy dog" |
awk 'match($0,/\ybrown\s+fox\s+(\w+)/, arr) {print arr[1]}'
Output
jumps
See a bash demo

How to remove all lines after a line containing some string?

I need to remove all lines after the first line that contains the word "fox".
For example, for the following input file "in.txt":
The quick
brown fox jumps
over
the
lazy dog
the second fox
is hunting
The result will be:
The quick
brown fox jumps
I prefer a script made in awk or sed but any other command line tools are good, like perl or php or python etc.
I am using gnuwin32 tools in Windows, and the solution I could find was this one:
grep -m1 -n fox in.txt | cut -f1 -d":" > var.txt
set /p MyNumber=<var.txt
head -%MyNumber% in.txt > out.txt
However, I am looking for a solution that is shorter and that is portable (this one contains Windows specific command set /p).
Similar questions:
How to delete all the lines after the last occurence of pattern?
How to delete lines before a match perserving it?
Remove all lines before a match with sed
How to delete the lines starting from the 1st line till line before encountering the pattern '[ERROR] -17-12-2015' using sed?
How to delete all lines before the first and after the last occurrence of a string?
awk '{print} /fox/{exit}' file
With GNU sed:
sed '0,/fox/!d' file
or
sed -n '0,/fox/p' file

Print the duplicate lines in a file using awk

I have a requirement to print all the duplicated lines in a file where in uniq -D option did not support. So I am thinking of an alternative way to print the duplicate lines using awk. I know that, we have an option in awk like below.
testfile.txt
apple
apple
orange
orange
cherry
cherry
kiwi
strawberry
strawberry
papaya
cashew
cashew
pista
The command:
awk 'seen[$0]++' testfile.txt
But the above does print only the unique duplicate lines. I need the same output that uniq -D command retrieves like this.
apple
apple
orange
orange
cherry
cherry
strawberry
strawberry
cashew
cashew
No need to parse the file twice:
$ awk 'c[$0]++; c[$0]==2' file
apple
apple
orange
orange
cherry
cherry
strawberry
strawberry
cashew
cashew
If you want to stick with just plain awk, you'll have to process the file twice: once to generate the counts, once to eliminate the lines with count equal 1:
awk 'NR==FNR {count[$0]++; next} count[$0]>1' testfile.txt testfile.txt
With sed:
$ sed 'N;/^\(.*\)\n\1$/p;$d;D' testfile.txt
apple
apple
orange
orange
cherry
cherry
strawberry
strawberry
cashew
cashew
This does the following:
N # Append next line to pattern space
/^\(.*\)\n\1$/p # Print if lines in pattern space are identical
$d # Avoid printing lone non-duplicate last line
D # Delete first line in pattern space
There are a few limitations:
It only works for contiguous duplicates, i.e., not for
apple
orange
apple
Lines appearing more than twice in a row throw it off.
This might work for you (GNU sed):
sed -rn ':a;N;/^([^\n]*)\n\1$/p;//ba;/^([^\n]*)(\n\1)+$/P;//ba;s/.*\n//;ba' file
Read two lines into the pattern space (PS). If the first two lines are duplicate, print them and loop back and read a third line. If the third or subsequent lines are duplicate, print the first and loop back and read another line. Otherwise, remove all but the last line and loop back and read another etc.
Something like this, if uniq supports -d?
grep -f <(uniq -d testfile.txt ) testfile.txt
awk '{if (x[$1]) { x_count[$1]++; print $0; if (x_count[$1] == 1) { print x[$1] } } x[$1] = $0}' testfile.txt
You can do:
$ uniq -d file | awk '1;1'

In a file with two word lines, grep only those lines which have both words from a whitelist

I have a file1:
green
yellow
apple
mango
and a file2:
red apple
blue banana
yellow mango
purple cabbage
I need to find elements from file2 where both words belong to the list in file1. So it should show:
yellow mango
I tried:
awk < file2 '{if [grep -q $1 file1] && [grep -q $2 file1]; then print $0; fi}'
I am getting syntax error.
This will do the trick:
$ awk 'NR==FNR{a[$0];next}($1 in a)&&($2 in a)' file1 file2
yellow mango
Explanation:
NR is a special awk variable the tracks the current line in the input and FNR tracks the current line in each individual file so the condition NR==FNR is only true when we are in the first file. a is a associative array where the keys are each unique line in the first file. $0 is the value of the current line in awk. The next statement jumps to the next line in file to the next part of skip is not executed. The final part is straight forward if the first field $1 is in the array a and the second field then print the current line. The default block in awk is {print $0} so this is implicit.
This is a very hackish approach and probably frowned upon by many of the grep/sed implementors. In addition it is probably terminal dependent. You have been warned.
GNU grep, when in color mode, highlights pieces of the input that were matched by one of the patterns, this could in theory, be used as a test for a full match. Here, this even works in practice, that is, with some help from GNU sed:
grep --color=always -f file1 file2 | sed -n '/^\x1b.*\x1b\[K *\x1b.*\x1b\[K$/ { s/\x1b\[K//g; s/\x1b[^m]*m//gp }'
Output:
yellow mango
Note that the sed pattern assumes space separated columns in file2.
You can do it with bash, sed and grep:
grep -f <(sed 's/^/^/' file1) file2 | grep -f <(sed 's/$/$/' file1)
this is a bit obscure, so I will break it down:
grep -f <file> reads a sequence of patterns from a file and will match on any of them.
<(...) is bash process substitution and will execute a shell command and create a pseudo-file with the output that can be used in place of a filename.
sed 's/^/^/' file1 inserts a ^ character at the start of each line in file1, turning the lines into patterns that will match the first word of file2.
sed 's/$/$/' file1 inserts a $ character at the end, so the patterns will match the second word.
Edit:
Use:
grep -f <(sed 's/^/^/;s/$/\b/' file1) file2 | grep -f <(sed 's/$/$/;s/^/\b/' file1)
to get round the issue that Jonathan pointed out in his comment.